Prostate cancer gene

ABSTRACT

The present invention relates to PG1, a gene associated with prostate cancer. The invention provides polynucleotides including biallelic markers derived from PG1 and from flanking genomic regions. Primers hybridizing to these biallelic markers and regions flanking are also provided. This invention provides polynucleotides and methods suitable for genotyping a nucleic acid containing sample for one or more biallelic markers of the invention. Further, the invention provides methods to detect a statistical correlation between a biallelic marker allele and prostate cancer and between a haplotype and prostate cancer. The invention also relates to diagnostic methods of determining whether an individual is at risk for developing prostate cancer, and whether an individual suffers from prostate cancer as a result of a mutation in the PG1 gene.

RELATED APPLICATION DATA

[0001] This application is a divisional of U.S. patent application Ser.No. 09/338,907, filed Jun. 23, 1999, which is a continuation-in-part ofU.S. patent application Ser. No. 09/218,207, filed Dec. 22, 1998, whichis a continuation-in-part of U.S. patent application Ser. No.08/996,306, filed Dec. 22, 1997, and claims priority from U.S.Provisional Patent Application Ser. No. 60/099,658, filed Sep. 9, 1998,the disclosures of which are all incorporated herein by reference intheir entireties.

BACKGROUND OF THE INVENTION

[0002] A cancer is a clonal proliferation of cells produced as aconsequence of cumulative genetic damage that finally results inunrestrained cell growth, tissue invasion and metastasis (celltransformation). Regardless of the type of cancer, transformed cellscarry damaged DNA in many forms: as gross chromosomal translocations or,more subtly, as DNA amplification, rearrangement or even pointmutations.

[0003] Some oncogenic mutations is inherited in the germline, thuspredisposing the mutation carrier to an increased risk of cancer.However, in a majority of cases, cancer does not occur as a simplemonogenic disease with clear Mendelian inheritance. There is only a two-or threefold increased risk of cancer among first-degree relatives formany cancers (Mulvihill J J, Miller R W & Fraumeni J F, 1977, Geneticsof human cancer Vol 3, New York Raven Press). Alternatively, DNA damageis acquired somatically, probably induced by exposure to environmentalcarcinogens. Somatic mutations are generally responsible for the vastmajority of cancer cases.

[0004] Studies of the age dependence of cancer have suggested thatseveral successive mutations are needed to convert a normal cell into aninvasive carcinoma. Since human mutation rates are typically10⁻⁶/gene/cell, the chance of a single cell undergoing many independentmutations is very low (Loeb L A, Cancer Res 1991, 51: 3075-3079). Cancernevertheless happens because of a combination of two mechanisms. Somemutations enhance cell proliferation, increasing the target populationof cells for the next mutation. Other mutations affect the stability ofthe entire genome, increasing the overall mutation rate, as in the caseof mismatch repair proteins (reviewed in Arnheim N & Shibata D, Curr.Op. Genetics & Development, 1997, 7:364-370).

[0005] An intricate process known as the cell cycle drives normalproliferation of cells in an organism. Regulation of the extent of cellcycle activity and the orderly execution of sequential steps within thecycle ensure the normal development and homeostasis of the organism.Conversely, many of the properties of cancer cells—uncontrolledproliferation, increased mutation rate, abnormal translocations and geneamplifications—can be attributed directly to perturbations of the normalregulation or progression of the cycle. In fact, many of the genes thathave been identified over the past several decades as being involved incancer, can now be appreciated in terms of their direct or indirect rolein either regulating entry into the cell cycle or coordinating eventswithin the cell cycle.

[0006] Recent studies have identified three groups of genes which arefrequently mutated in cancer. The first group of genes, calledoncogenes, are genes whose products activate cell proliferation. Thenormal non-mutant versions are called protooncogenes. The mutated formsare excessively or inappropriately active in promoting cellproliferation, and act in the cell in a dominant way in that a singlemutant allele is enough to affect the cell phenotype. Activatedoncogenes are rarely transmitted as germline mutations since they mayprobably be lethal when expressed in all the cells. Therefore oncogenescan only be investigated in tumor tissues.

[0007] Oncogenes and protooncogenes can be classified into severaldifferent categories according to their function. This classificationincludes genes that code for proteins involved in signal transductionsuch as: growth factors (i.e., sis, int-2); receptor and non-receptorprotein-tyrosine kinases (i.e., erbB, src, bcr-abl, met, trk);membrane-associated G proteins (i.e., ras); cytoplasmic protein kinases(i.e., mitogen-activated protein kinase—MAPK—family, raf, mos, pak), ornuclear transcription factors (i.e., myc, myb, fos, jun, rel) (forreview see Hunter T, 1991 Cell 64:249; Fanger G R et al., 1997Curr.Op.Genet.Dev.7:67-74; Weiss F U et al., ibid. 80-86).

[0008] The second group of genes which are frequently mutated in cancer,called tumor suppressor genes, are genes whose products inhibit cellgrowth. Mutant versions in cancer cells have lost their normal function,and act in the cell in a recessive way in that both copies of the genemust be inactivated in order to change the cell phenotype. Mostimportantly, the tumor phenotype can be rescued by the wild type allele,as shown by cell fusion experiments first described by Harris andcolleagues (Harris H et al.,1969,Nature 223:363-368). Germline mutationsof tumor suppressor genes is transmitted and thus studied in bothconstitutional and tumor DNA from familial or sporadic cases. Thecurrent family of tumor suppressors includes DNA-binding transcriptionfactors (i.e., p53, WT1), transcription regulators (i.e., RB, APC,probably BRCA1), protein kinase inhibitors (i.e., p16), among others(for review, see Haber D & Harlow E, 1997, Nature Genet. 16:320-322).

[0009] The third group of genes which are frequently mutated in cancer,called mutator genes, are responsible for maintaining genome integrityand/or low mutation rates. Loss of function of both alleles increasecell mutation rates, and as consequence, proto-oncogenes and tumorsuppressor genes is mutated. Mutator genes can also be classified astumor suppressor genes, except for the fact that tumorigenesis caused bythis class of genes cannot be suppressed simply by restoration of awild-type allele, as described above. Genes whose inactivation may leadto a mutator phenotype include mismatch repair genes (i.e., MLH1, MSH2),DNA helicases (i.e., BLM, WRN) or other genes involved in DNA repair andgenomic stability (i.e., p53, possibly BRCA1 and BRCA2) (For review seeHaber D & Harlow E, 1997, Nature Genet. 16:320-322; Fishel R & Wilson T.1997, Curr.Op.Genet.Dev.7: 105-113; Ellis N A,1997 ibid.354-363).

[0010] The recent development of sophisticated techniques for geneticmapping has resulted in an ever expanding list of genes associated withparticular types of human cancers. The human haploid genome contains anestimated 80,000 to 100,000 genes scattered on a 3×10⁹ base-longdouble-stranded DNA. Each human being is diploid, i.e., possesses twohaploid genomes, one from paternal origin, the other from maternalorigin. The sequence of a given genetic locus may vary betweenindividuals in a population or between the two copies of the locus onthe chromosomes of a single individual. Genetic mapping techniques oftenexploit these differences, which are called polymorphisms, to map thelocation of genes associated with human phenotypes.

[0011] One mapping technique, called the loss of heterozygosity (LOH)technique, is often employed to detect genes in which a loss of functionresults in a cancer, such as the tumor suppressor genes described above.Tumor suppressor genes often produce cancer via a two hit mechanism inwhich a first mutation, such as a point mutation (or a small deletion orinsertion) inactivates one allele of the tumor suppressor gene. Often,this first mutation is inherited from generation to generation.

[0012] A second mutation, often a spontaneous somatic mutation such as adeletion which deletes all or part of the chromosome carrying the othercopy of the tumor suppressor gene, results in a cell in which bothcopies of the tumor suppressor gene are inactive.

[0013] As a consequence of the deletion in the tumor suppressor gene,one allele is lost for any genetic marker located close to the tumorsuppressor gene. Thus, if the patient is heterozygous for a marker, thetumor tissue loses heterozygosity, becoming homozygous or hemizygous.This loss of heterozygosity generally provides strong evidence for theexistence of a tumor suppressor gene in the lost region.

[0014] By genotyping pairs of blood and tumor samples from affectedindividuals with a set of highly polymorphic genetic markers, such asmicrosatellites, covering the whole genome, one can discover candidatelocations for tumor suppressor genes. Due to the presence of contaminantnon-tumor tissue in most pathological tumor samples, a decreasedrelative intensity rather than total loss of heterozygosity ofinformative microsatellites is observed in the tumor samples. Therefore,classic LOH analysis generally requires quantitative PCR analysis, oftenlimiting the power of detection of this technique. Another limitation ofLOH studies resides on the fact that they only allow the definition ofrather large candidate regions, typically spanning over severalmegabases. Refinement of such candidate regions requires the definitionof the minimally overlapping portion of LOH regions identified in tumortissues from several hundreds of affected patients.

[0015] Another approach to genetic mapping, called linkage analysis, isbased upon establishing a correlation between the transmission ofgenetic markers and that of a specific trait throughout generationswithin a family. In this approach, all members of a series of affectedfamilies are genotyped with a few hundred markers, typicallymicrosatellite markers, which are distributed at an average density ofone every 10 Mb. By comparing genotypes in all family members, one canattribute sets of alleles to parental haploid genomes (haplotyping orphase determination). The origin of recombined fragments is thendetermined in the offspring of all families. Those that co-segregatewith the trait are tracked. After pooling data from all families,statistical methods are used to determine the likelihood that the markerand the trait are segregating independently in all families. As a resultof the statistical analysis, one or several regions are selected ascandidates, based on their high probability to carry a trait causingallele. The result of linkage analysis is considered as significant whenthe chance of independent segregation is lower than 1 in 1000 (expressedas a LOD score>3). Identification of recombinant individuals usingadditional markers allows further delineation of the candidate linkedregion, which most usually ranges from 2 to 20 Mb.

[0016] Linkage analysis studies have generally relied on the use ofmicrosatellite markers (also called simple tandem repeat polymorphisms,or simple sequence length polymorphisms). These include small arrays oftandem repeats of simple sequences (di- tri- tetra-nucleotide repeats),which exhibit a high degree of length polymorphism, and thus a highlevel of informativeness. To date, only just more than 5,000microsatellites have been ordered along the human genome (Dib et al.,Nature 1996, 380: 152), thus limiting the maximum attainable resolutionof linkage analysis to ca. 600 kb on average.

[0017] Linkage analysis has been successfully applied to map simplegenetic traits that show clear Mendelian inheritance patterns. About 100pathological trait-causing genes were discovered by linkage analysisover the last 10 years.

[0018] However, linkage analysis approaches have proven difficult forcomplex genetic traits, those probably due to the combined action ofmultiple genes and/or environmental factors. In such cases, too large aneffort and cost are needed to recruit the adequate number of affectedfamilies required for applying linkage analysis to these situations, asrecently discussed by Risch, N. and Merikangas, K. (Science 1996, 273:1516-1517). Finally, linkage analysis cannot be applied to the study oftraits for which no available large informative families are available.Typically, this will be the case in any attempt to identifytrait-causing alleles involved in sporadic cases.

[0019] The incidence of prostate cancer has dramatically increased overthe last decades. It averages 30-50/100,000 males both in WesternEuropean countries as well as within the US White male population. Inthese countries, it has recently become the most commonly diagnosedmalignancy, being one of every four cancers diagnosed in American males.Prostate cancer's incidence is very much population specific, since itvaries from 2/100,000 in China, to over 80/100,000 amongAfrican-American males.

[0020] In France, the incidence of prostate cancer is 35/100,000 malesand it is increasing by 10/100,000 per decade. Mortality due to prostatecancer is also growing accordingly. It is the second cause of cancerdeath among French males, and the first one among French males aged over70. This makes prostate cancer a serious burden in terms of publichealth, especially in view of the aging of populations.

[0021] An average 40% reduction in life expectancy affects males withprostate cancer. If completely localized, prostate cancer can be curedby surgery, with however an average success rate of only ca. 50%. Ifdiagnosed after metastasis from the prostate, prostate cancer is a fataldisease for which there is no curative treatment.

[0022] Early-stage diagnosis relies on Prostate Specific Antigen (PSA)dosage, and would allow the detection of prostate cancer seven yearsbefore clinical symptoms become apparent. The effectiveness of PSAdosage diagnosis is however limited, due to its inability todiscriminate between malignant and non-malignant affections of theorgan.

[0023] Therefore, there is a strong need for both a reliable diagnosticprocedure which would enable early-stage prostate cancer prognosis, andfor preventive and curative treatments of the disease. The presentinvention relates to the PG1 gene, a gene associated with prostatecancer, as well as diagnostic methods and reagents for detecting allelesof the gene which may cause prostate cancer, and therapies for treatingprostate cancer.

SUMMARY OF THE INVENTION

[0024] The present invention relates to the identification of a geneassociated with prostate cancer, identified as the PG1 gene, andreagents, diagnostics, and therapies related thereto. The presentinvention is also based on the discovery of a novel set of PG1-relatedbiallelic markers. See the definition of PG1-related biallelic markersin the Detailed Description Section. These markers are located in thecoding regions as well as non-coding regions adjacent to the PG1 gene.The position of these markers and knowledge of the surrounding sequencehas been used to design polynucleotide compositions which are useful indetermining the identity of nucleotides at the marker position, as wellas more complex association and haplotyping studies which are useful indetermining the genetic basis for diseases including cancer and prostatecancer. In addition, the compositions and methods of the invention finduse in the identification of the targets for the development ofpharmaceutical agents and diagnostic methods, as well as thecharacterization of the differential efficacious responses to and sideeffects from pharmaceutical agents acting on diseases including cancerand prostate cancer.

[0025] A first embodiment of the invention is a recombinant, purified orisolated polynucleotide comprising, or consisting of a mammalian genomicsequence, gene, or fragments thereof. In one aspect the sequence isderived from a human, mouse or other mammal. In a preferred aspect, thegenomic sequence is the human genomic sequence of SEQ ID NO: 179 or thecomplement thereto. In a second preferred aspect, the genomic sequenceis selected from one of the two mouse genomic fragments of SEQ ID NO:182 and 183. In yet another aspect of this embodiment, the nucleic acidcomprises nucleotides 1629 through 1870 of the sequence of SEQ ID NO:179. Optionally, said polynucleotide consists of, consists essentiallyof, or comprises a contiguous span of nucleotides of a mammalian genomicsequence, preferably a sequence selected the following SEQ ID NOs: 179,182, and 183, wherein said contiguous span is at least 6, 8, 10, 12, 15,20, 25, 30, 50, 100, 200, or 500 nucleotides in length.

[0026] A second embodiment of the present invention is a recombinant,purified or isolated polynucleotide comprising, or consisting of amammalian cDNA sequence, or fragments thereof. In one aspect thesequence is derived from a human, mouse or other mammal. In a preferredaspect, the cDNA sequence is selected from the human cDNA sequences ofSEQ ID NO: 3, 69, 112-124 or the complement thereto. In a secondpreferred aspect, the cDNA sequence is the mouse cDNA sequence of SEQ IDNO: 184. Optionally, said polynucleotide consists of, consistsessentially of, or comprises a contiguous span of nucleotides of amammalian genomic sequence, preferably a sequence selected the followingSEQ ID NOs: 3, 69, 112-124 and 184, wherein said contiguous span is atleast 6, 8, 10, 12, 15, 20, 25, 30, 50, 100, 200, or 500 nucleotides inlength.

[0027] A third embodiment of the present invention is a recombinant,purified or isolated polynucleotide, or the complement thereof, encodinga mammalian PG1 protein, or a fragment thereof. In one aspect the PG1protein sequence is from a human, mouse or other mammal. In a preferredaspect, the PG1 protein sequence is selected from the human PG1 proteinsequences of SEQ ID NO: 4, 5, 70, and 125-136. In a second preferredaspect, the PG1 protein sequence is the mouse PG1 protein sequences ofSEQ ID NO: 74. Optionally, said fragment of PG1 polypeptide consists of,consists essentially of, or comprises a contiguous stretch of at least8, 10, 12, 15, 20, 25, 30, 50, 100 or 200 amino acids from SEQ ID NOs:4, 5, 70, 74, and 125-136, as well as any other human, mouse ormammalian PG1 polypeptide.

[0028] A fourth embodiment of the invention are the polynucleotideprimers and probes disclosed herein

[0029] A fifth embodiment of the present invention is a recombinant,purified or isolated polypeptide comprising or consisting of a mammalianPG1 protein, or a fragment thereof In one aspect the PG1 proteinsequence is from a human, mouse or other mammal. In a preferred aspect,the PG1 protein sequence is selected from the human PG1 proteinsequences of SEQ ID NO: 4, 5, 70, and 125-136. In a second preferredaspect, the PG1 protein sequence is the mouse PG1 protein sequences ofSEQ ID NO: 74. Optionally, said fragment of PG1 polypeptide consists of,consists essentially of, or comprises a contiguous stretch of at least8, 10, 12, 15, 20, 25, 30, 50, 100 or 200 amino acids from SEQ ID NOs:4, 5, 70, 74, and 125-136, as well as any other human, mouse ormammalian PG1 polypeptide.

[0030] A sixth embodiment of the present invention is an antibodycomposition capable of specifically binding to a polypeptide of theinvention. Optionally, said antibody is polyclonal or monoclonal.Optionally, said polypeptide is an epitope-containing fragment of atleast 8, 10, 12, 15, 20, 25, or 30 amino acids of a human, mouse, ormammalian PG1 protein, preferably a sequence selected from SEQ ID NOs:4, 5, 70, 74, or 125-136.

[0031] A seventh embodiment of the present invention is a vectorcomprising any polynucleotide of the invention. Optionally, said vectoris an expression vector, gene therapy vector, amplification vector, genetargeting vector, or knock-out vector.

[0032] An eighth embodiment of the present invention is a host cellcomprising any vector of the invention.

[0033] A ninth embodiment of the present invention is a mammalian hostcell comprising a PG1 gene disrupted by homologous recombination with aknock out vector.

[0034] A tenth embodiment of the present invention is a nonhuman hostmammal or animal comprising a vector of the invention.

[0035] A further embodiment of the present invention is a nonhuman hostmammal comprising a PG1 gene disrupted by homologous recombination witha knock out vector.

[0036] Another embodiment of the present invention is a method ofdetermining whether an individual is at risk of developing cancer orprostate cancer at a later date or whether the individual suffers fromcancer or prostate cancer as a result of a mutation in the PG1 genecomprising obtaining a nucleic acid sample from the individual; anddetermining whether the nucleotides present at one or more of thePG1-related biallelic markers of the invention are indicative of a riskof developing prostate cancer at a later date or indicative of prostatecancer resulting from a mutation in the PG1 gene. Optionally, saidPG1-related biallelic is a PG1-related biallelic marker positioned inSEQ ID NO: 179; a PG1-related biallelic marker selected from the groupconsisting of 99-1485/251, 99-622/95, 99-619/141, 4-761222, 4-77/151,4-71/233, 4-72/127, 4-73/134, 99-610/250, 99-609/225, 4-90/283,99-602/258, 99-600/492, 99-598/130, 99-217/277, 99-576/421, 4-611269,4-66/145, and 4-67/40; or a PG1-related biallelic marker selected fromthe group consisting of 99-622, 4-77, 4-71, 4-73, 99-598, 99-576, and4-66.

[0037] Another embodiment of the present invention is a method ofdetermining whether an individual is at risk of developing prostatecancer at a later date or whether the individual suffers from prostatecancer as a result of a mutation in the PG1 gene comprising obtaining anucleic acid sample from the individual and determining whether thenucleotides present at one or more of the polymorphic bases in aPG1-related biallelic marker. Optionally, said PG1-related biallelic isa PG1-related biallelic marker positioned in SEQ ID NO: 179; aPG1-related biallelic marker selected from the group consisting of99-1485/251, 99-622/95, 99-619/141, 4-76/222, 4-77/151, 4-71/233,4-72/127, 4-73/134, 99-610/250, 99-609/225, 4-90/283, 99-602/258,99-600/492, 99-598/130, 99-217/277, 99-576/421, 4-61/269, 4-66/145, and4-67/40; or a PG1-related biallelic marker selected from the groupconsisting of 99-622, 4-77, 4-71, 4-73, 99-598, 99-576, and 4-66.

[0038] Another embodiment of the present invention is a method ofobtaining an allele of the PG1 gene which is associated with adetectable phenotype comprising obtaining a nucleic acid sample from anindividual expressing the detectable phenotype, contacting the nucleicacid sample with an agent capable of specifically detecting a nucleicacid encoding the PG1 protein, and isolating the nucleic acid encodingthe PG1 protein. In one aspect of this method, the contacting stepcomprises contacting the nucleic acid sample with at least one nucleicacid probe capable of specifically hybridizing to said nucleic acidencoding the PG1 protein. In another aspect of this embodiment, thecontacting step comprises contacting the nucleic acid sample with anantibody capable of specifically binding to the PG1 protein. In anotheraspect of this embodiment, the step of obtaining a nucleic acid samplefrom an individual expressing a detectable phenotype comprises obtaininga nucleic acid sample from an individual suffering from prostate cancer.

[0039] Another embodiment of the present invention is a method ofobtaining an allele of the PG1 gene which is associated with adetectable phenotype comprising obtaining a nucleic acid sample from anindividual expressing the detectable phenotype, contacting the nucleicacid sample with an agent capable of specifically detecting a sequencewithin the 8p23 region of the human genome, identifying a nucleic acidencoding the PG1 protein in the nucleic acid sample, and isolating thenucleic acid encoding the PG1 protein. In one aspect of this embodiment,the nucleic acid sample is obtained from an individual suffering fromcancer or prostate cancer.

[0040] Another embodiment of the present invention is a method ofcategorizing the risk of prostate cancer in an individual comprising thestep of assaying a sample taken from the individual to determine whetherthe individual carries an allelic variant of PG1 associated with anincreased risk of prostate cancer. In one aspect of this embodiment, thesample is a nucleic acid sample. In another aspect a nucleic acid sampleis assayed by determining the frequency of the PG1 transcripts present.In another aspect of this embodiment, the sample is a protein sample. Inanother aspect of this embodiment, the method further comprisesdetermining whether the PG1 protein in the sample binds an antibodyspecific for a PG1 isoform associated with prostate cancer.

[0041] Another embodiment of the present invention is a method ofcategorizing the risk of prostate cancer in an individual comprising thestep of determining whether the identities of the polymorphic bases ofone or more biallelic markers which are in linkage disequilibrium withthe PG1 gene are indicative of an increased risk of prostate cancer.

[0042] Another embodiment of the present invention comprises a method ofidentifying molecules which specifically bind to a PG1 protein,preferably the protein of SEQ ID NO: 4 or a portion thereof: comprisingthe steps of introducing a nucleic a nucleic acid encoding the proteinof SEQ ID NO: 4 or a portion thereof into a cell such that the proteinof SEQ ID NO: 4 or a portion thereof contacts proteins expressed in thecell and identifying those proteins expressed in the cell whichspecifically interact with the protein of SEQ ID NO: 4 or a portionthereof.

[0043] Another embodiment of the present invention is a method ofidentifying molecules which specifically bind to the protein of SEQ IDNO: 4 or a portion thereof. One step of the method comprises linking afirst nucleic acid encoding the protein of SEQ ID NO: 4 or a portionthereof to a first indicator nucleic acid encoding a first indicatorpolypeptide to generate a first chimeric nucleic acid encoding a firstfusion protein. The first fusion protein comprises the protein of SEQ IDNO: 4 or a portion thereof and the first indicator polypeptide. Anotherstep of the method comprises linking a second nucleic acid nucleic acidencoding a test polypeptide to a second indicator nucleic acid encodinga second indicator polypeptide to generate a second chimeric nucleicacid encoding a second fusion protein. The second fusion proteincomprises the test polypeptide and the second indicator polypeptide.Association between the first indicator protein and the second indicatorprotein produces a detectable result. Another step of the methodcomprises introducing the first chimeric nucleic acid and the secondchimeric nucleic acid into a cell. Another step comprises detecting thedetectable result.

[0044] A further embodiment of the invention is a purified or isolatedmammalian PG1 gene or cDNA sequence.

[0045] Further embodiments of the present invention include the nucleicacid and amino acid sequences of mutant or low frequency PG1 allelesderived from prostate cancer patients, tissues or cell lines. Thepresent invention also encompasses methods which utilize detection ofthese mutant PG1 sequences in an individual or tissue sample todiagnosis prostate cancer, assess the risk of developing prostate canceror assess the likely severity of a particular prostate tumor.

[0046] Another embodiment of the invention encompasses anypolynucleotide of the invention attached to a solid support. Inaddition, the polynucleotides of the invention which are attached to asolid support encompass polynucleotides with any further limitationdescribed in this disclosure, or those following: Optionally, saidpolynucleotides is specified as attached individually or in groups of atleast 2, 5, 8, 10, 12, 15, 20, or 25 distinct polynucleotides of theinventions to a single solid support. Optionally, polynucleotides otherthan those of the invention may attached to the same solid support aspolynucleotides of the invention. Optionally, when multiplepolynucleotides are attached to a solid support they are attached atrandom locations, or in an ordered array. Optionally, said ordered arrayis addressable.

[0047] An additional embodiment of the invention encompasses the use ofany polynucleotide for, or any polynucleotide for use in, determiningthe identity of an allele at a PG1-related biallelic marker. Inaddition, the polynucleotides of the invention for use in determiningthe identity of an allele at a PG6-related biallelic marker encompasspolynucleotides with any further limitation described in thisdisclosure, or those following: Optionally, said PG6-related biallelicmarker is a PG6-related biallelic marker positioned in SEQ ID NO: 179; aPG6-related biallelic marker selected from the group consisting of99-1485/251, 99-622/95, 99-619/141, 4-76/222, 4-77/151, 4-71/233,4-721127, 4-73/134, 99-610/250, 99-609/225, 4-90/283, 99-602/258,99-600/492, 99-598/130, 99-217/277, 99-576/421, 4-61/269, 4-66/145, and4-67/40; or a PG1-related biallelic marker selected from the groupconsisting of 99-622, 4-77, 4-71, 4-73, 99-598, 99-576, and 4-66.Optionally, said polynucleotide may comprise a sequence disclosed in thepresent specification. Optionally, said polynucleotide may consist of,or consist essentially of any polynucleotide described in the presentspecification. Optionally, said determining is performed in ahybridization assay, sequencing assay, microsequencing assay, orallele-specific amplification assay. Optionally, said polynucleotide isattached to a solid support, array, or addressable array. Optionally,said polynucleotide is labeled.

[0048] Another embodiment of the invention encompasses the use of anypolynucleotide for, or any polynucleotide for use in, amplifying asegment of nucleotides comprising an PG1-related biallelic marker. Inaddition, the polynucleotides of the invention for use in amplifying asegment of nucleotides comprising a PG1-related biallelic markerencompass polynucleotides with any further limitation described in thisdisclosure, or those following: Optionally, said PG1-related biallelicmarker is a PG1-related biallelic marker positioned in SEQ ID NO: 179; aPG1-related biallelic marker selected from the group consisting of99-1485/251, 99-622/95, 99-619/141, 4-76/222, 4-77/151, 4-71/233,4-72/127, 4-73/134, 99-610/250, 99-609/225, 4-90/283, 99-602/258,99-600/492, 99-598/130, 99-217/277, 99-576/421, 4-61/269, 4-66/145, and4-67/40; or a PG1-related biallelic marker selected from the groupconsisting of 99-622, 4-77, 4-71, 4-73, 99-598, 99-576, and 4-66.Optionally, said polynucleotide may comprise a sequence disclosed in thepresent specification. Optionally, said polynucleotide may consist of,or consist essentially of any polynucleotide described in the presentspecification. Optionally, said amplifying is performed by a PCR or LCR.Optionally, said polynucleotide is attached to a solid support, array,or addressable array. Optionally, said polynucleotide is labeled.

[0049] A further embodiment of the invention encompasses methods ofgenotyping a biological sample comprising determining the identity of anallele at an PG1-related biallelic marker. In addition, the genotypingmethods of the invention encompass methods with any further limitationdescribed in this disclosure, or those following: Optionally, saidPG1-related biallelic marker is a PG1-related biallelic markerpositioned in SEQ ID NO: 179; a PG1-related biallelic marker selectedfrom the group consisting of 99-1485/251, 99-622/95, 99-619/141,4-76/222, 4-77/151, 4-71/233, 4-72/127, 4-73/134, 99-610/250,99-609/225, 4-90/283, 99-602/258, 99-600/492, 99-598/130, 99-217/277,99-576/421, 4-61/269, 4-66/145, and 4-67/40; or a PG1-related biallelicmarker selected from the group consisting of 99-622, 4-77, 4-71, 4-73,99-598, 99-576, and 4-66. Optionally, said method further comprisesdetermining the identity of a second allele at said biallelic marker,wherein said first allele and second allele are not base paired (byWatson & Crick base pairing) to one another. Optionally, said biologicalsample is derived from a single individual or subject. Optionally, saidmethod is performed in vitro. Optionally, said biallelic marker isdetermined for both copies of said biallelic marker present in saidindividual's genome. Optionally, said biological sample is derived frommultiple subjects or individuals. Optionally, said method furthercomprises amplifying a portion of said sequence comprising the biallelicmarker prior to said determining step. Optionally, wherein saidamplifying is performed by PCR, LCR, or replication of a recombinantvector comprising an origin of replication and said portion in a hostcell. Optionally, wherein said determining is performed by ahybridization assay, sequencing assay, microsequencing assay, orallele-specific amplification assay.

[0050] An additional embodiment of the invention comprises methods ofestimating the frequency of an allele in a population comprisingdetermining the proportional representation of an allele at aPG1-related biallelic marker in said population. In addition, themethods of estimating the frequency of an allele in a population of theinvention encompass methods with any further limitation described inthis disclosure, or those following: Optionally, said PG1-relatedbiallelic marker is a PG1-related biallelic marker positioned in SEQ IDNO: 179; a PG1-related biallelic marker selected from the groupconsisting of 99-1485/251, 99-622/95, 99-619/141, 4-76/222, 4-77/151,4-71/233, 4-72/127, 4-73/134, 99-610/250, 99-609/225, 4-90/283,99-602/258, 99-600/492, 99-598/130, 99-217/277, 99-576/421, 4-61/269,4-66/145, and 4-67/40; or a PG1-related biallelic marker selected fromthe group consisting of 99-622, 4-77, 4-71, 4-73, 99-598, 99-576, and4-66. Optionally, determining the proportional representation of anallele at a PG1-related biallelic marker is accomplished by determiningthe identity of the alleles for both copies of said biallelic markerpresent in the genome of each individual in said population andcalculating the proportional representation of said allele at saidPG1-related biallelic marker for the population. Optionally, determiningthe proportional representation is accomplished by performing agenotyping method of the invention on a pooled biological sample derivedfrom a representative number of individuals, or each individual, in saidpopulation, and calculating the proportional amount of said nucleotidecompared with the total.

[0051] A further embodiment of the invention comprises methods ofdetecting an association between a genotype and a phenotype, comprisingthe steps of a) genotyping at least one PG1-related biallelic marker ina trait positive population according to a genotyping method of theinvention; b) genotyping said PG1-related biallelic marker in a controlpopulation according to a genotyping method of the invention; and c)determining whether a statistically significant association existsbetween said genotype and said phenotype. In addition, the methods ofdetecting an association between a genotype and a phenotype of theinvention encompass methods with any further limitation described inthis disclosure, or those following: Optionally, said PG1-relatedbiallelic marker is a PG1-related biallelic marker positioned in SEQ IDNO: 179; a PG1-related biallelic marker selected from the groupconsisting of 99-1485/251, 99-622/95, 99-619/141, 4-76/222, 4-77/151,4-71/233, 4-72/127, 4-73/134, 99-610/250, 99-609/225, 4-90/283,99-602/258, 99-600/492, 99-598/130, 99-217/277, 99-576/421, 4-61/269,4-66/145, and 4-67/40; or a PG1-related biallelic marker selected fromthe group consisting of 99-622, 4-77, 4-71, 4-73, 99-598, 99-576, and4-66. Optionally, said contr population is a trait negative population,or a random population. Optionally, each of said genotyping steps a) andb) is performed on a single pooled biological sample derived from eachof said populations. Optionally, each of said genotyping of steps a) andb) is performed separately on biological samples derived from eachindividual in said population or a subsample thereof. Optionally, saidphenotype is a disease, cancer or prostate cancer; a response to ananti-cancer agent or an anti-prostate cancer agent; or a side effect toan anti-cancer or anti-prostate cancer agent. Optionally, said methodcomprises the additional steps of determining the phenotype in saidtrait positive and said control populations prior to step c).

[0052] An additional embodiment of the present invention encompassesmethods of estimating the frequency of a haplotype for a set ofbiallelic markers in a population, comprising the steps of: a)genotyping at least one PG1-related biallelic marker for both copies ofsaid set of biallelic marker present in the genome of each individual insaid population or a subsample thereof, according to a genotyping methodof the invention; b) genotyping a second biallelic marker by determiningthe identity of the allele at said second biallelic marker for bothcopies of said second biallelic marker present in the genome of eachindividual in said population or said subsample, according to agenotyping method of the invention; and c) applying a haplotypedetermination method to the identities of the nucleotides determined insteps a) and b) to obtain an estimate of said frequency. In addition,the methods of estimating the frequency of a haplotype of the inventionencompass methods with any further limitation described in thisdisclosure, or those following: Optionally, said PG1-related biallelicmarker is a PG1-related biallelic marker positioned in SEQ ID NO: 179; aPG1-related biallelic marker selected from the group consisting of99-1485/251, 99-622/95, 99-619/141, 4-76/222, 4-77/151, 4-71/233,4-72/127, 4-73/134, 99-610/250, 99-609/225, 4-90/283, 99-602/258,99-600/492, 99-598/130, 99-217/277, 99-576/421, 4-61/269, 4-66/145, and4-67/40; or a PG1-related biallelic marker selected from the groupconsisting of 99-622, 4-77, 4-71, 4-73, 99-598, 99-576, and 4-66.Optionally, said second biallelic marker is a PG1-related biallelicmarker; PG1-related biallelic marker positioned in SEQ ID NO: 179; aP1-related biallelic marker selected from the group consisting of99-1485/251, 99-622/95, 99-619/141, 4-76/222, 4-77/151, 4-71/233,4-72/127, 4-73/134, 99-610/250, 99-609/225, 4-90/283, 99-602/258,99-600/492, 99-598/130, 99-217/277, 99-576/421, 4-61/269, 4-66/145, and4-67/40; or a PG1-related biallelic marker selected from the groupconsisting of 99-622, 4-77, 4-71, 4-73, 99-598, 99-576, and 4-66.Optionally, said PG1-related biallelic marker and said second biallelicmarker are 4-77/151 and 4-66/145. Optionally, said haplotypedetermination method is an expectation-maximization algorithm.

[0053] An additional embodiment of the present invention encompassesmethods of detecting an association between a haplotype and a phenotype,comprising the steps of: a) estimating the frequency of at least onehaplotype in a trait positive population, according to a method of theinvention for estimating the frequency of a haplotype; b) estimating thefrequency of said haplotype in a control population, according to amethod of the invention for estimating the frequency of a haplotype; andc) determining whether a statistically significant association existsbetween said haplotype and said phenotype. In addition, the methods ofdetecting an association between a haplotype and a phenotype of theinvention encompass methods with any further limitation described inthis disclosure, or those following: Optionally, said PG1-relatedbiallelic is a PG1-related biallelic marker positioned in SEQ ID NO:179; a PG1-related biallelic marker selected from the group consistingof 99-1485/251, 99-622/95, 99-619/141, 4-76/222, 4-77/151, 4-71/233,4-72/127, 4-73/134, 99-610/250, 99-609/225, 4-90/283, 99-602/258,99-600/492, 99-598/130, 99-217/277, 99-576/421, 4-61/269, 4-66/145, and4-67/40; or a PG1-related biallelic marker selected from the groupconsisting of 99-622, 4-77, 4-71, 4-73, 99-598, 99-576, and 4-66.Optionally, said PG1-related biallelic marker and said second biallelicmarker are 4-77/151 and 4-66/145. Optionally, said haplotype exhibits ap-value of<1×10⁻³ in an association with a trait positive populationwith cancer, preferably prostate cancer. Optionally, said controlpopulation is a trait negative population, or a random population.Optionally, said phenotype is a disease, cancer or prostate cancer; aresponse to an anti-cancer agent or an anti-prostate cancer agent, or aside effects to an anti-cancer or anti-prostate cancer agent.Optionally, said method comprises the additional steps of determiningthe phenotype in said trait positive and said control populations priorto step c).

[0054] Additional embodiments and aspects of the present invention areset forth in the Detailed Description of the Invention and the Examples.

BRIEF DESCRIPTION OF THE DRAWINGS

[0055]FIG. 1 is a diagram showing the BAC contig containing the PG1 geneand the positions of biallelic markers along the contig.

[0056]FIG. 2 is a graph showing the results of the first screening of aprostate cancer association study and the significance of variousbiallelic markers as measured by their chi squared and p-values for alow density set of markers.

[0057]FIG. 3 is a graph showing the results of the first screening of aprostate cancer association study and the significance of variousbiallelic markers as measured by their chi squared and p-values for ahigher density set of markers.

[0058]FIG. 4 is a table demonstrating the results of an haplotypeanalysis. Among all the theoretical potential different haplotypes basedon 2 to 9 markers, 11 haplotypes showing a strong association withprostate cancer were selected, and their haplotype analysis results areshown here.

[0059]FIG. 5 is a bar graph demonstrating the results of an experimentevaluating the significance (p-values) of the haplotype analysis shownin FIG. 4.

[0060]FIG. 6A is a table listing the biallelic markers used in thehaplotype analysis of FIG. 4. FIG. 6B is a table listing additionalbiallelic markers in linkage disequilibrium with the PG1 gene.

[0061]FIG. 7 is a table listing the positions of exons, splice sites, astop codon, and a poly A site in the PG1 gene.

[0062]FIG. 8A is a diagram showing the genomic structure of PG1 incomparison with its most abundant mRNA transcript. FIG. 8B is a moredetailed diagram showing the genomic structure of PG1, including exonsand introns.

[0063]FIG. 9 is a table listing some of the homologies between the PG1protein and known proteins.

[0064]FIG. 10 is a half-tome reproduction of a fluorescence micrographof the perinuclear/nuclear expression of PG1 in tumoral (PC3) and normalprostatic cell lines (PNT2). Vector “PG1”: includes all the coding exonsfrom exon 1 to 8. For PC3 (upper panel) and PNT2 (lower panel), thenucleus was labelled with Propidium iodide (IP, left panel). Note thatEGFP fluorescence was detected in and around the nucleus (GFP, middlepanel), as shown when the two pictures were overlapped (right panel).

[0065]FIG. 11 is a half-tome reproduction of a fluorescence micrographof the perinuclear/nuclear expression of PG1/1-4 in tumoral (PC3) andnormal prostatic cell lines (PNT2). Vector “PG1/1-4” corresponds to analternative messenger which is due to an alternative splicing, joiningexon 1 to exon 4, and resulting in the absence of exons 2 and 3. For PC3(upper panel) and PNT2 (lower panel), the nucleus was labelled withPropidium iodide (IP, left panel). Note that EGFP fluorescence wasdetected in and around the nucleus (GFP, middle panel), as shown whenthe two pictures were overlapped (right panel).

[0066]FIG. 12 is a half-tome reproduction of a fluorescence micrographof the perinuclear/nuclear expression of PG/1-5 in tumoral prostaticcell line (PC3) and cytoplasmic expression of PG1/1-5 in normalprostatic cell line (PNT2). Vector “PG1/1-5” corresponds to analternative messenger which is due to an alternative splicing, joiningexon 1 to exon 5, and resulting in the absence of exons 2, 3 and 4. ForPC3 (upper panel) and PNT2 (lower panels), the nucleus was labelled withPropidium iodide (IP). Note that in PC3 cells, EGFP fluorescence wasdetected in and around the nucleus (GFP, upper middle panel), as shownwhen the two picture were overlapped (upper right panel). In PNT2Acells, EGFP fluorescence was detected in the cytoplasm (GFP, lower leftpanel), as shown when the two pictures were overlapped (lower rightpanel).

[0067]FIG. 13 is a half-tome reproduction of a fluorescence micrographof the perinuclear/nuclear expression of a mutated form PG1 (PG1mut229)in normal prostatic cell line (PNT2). Vector “PG1/1-7” includes exons 1to 6, and corresponds to the mutated form identified in genomic DNA ofthe prostatic tumoural cell line LNCaP. The nucleus was labelled withPropidium iodide (IP, left panel). EGFP fluorescence was detected in thecytoplasm (GFP, middle panel), as shown when the two pictures wereoverlapped (lower right panel).

[0068]FIG. 14 is a diagram of the structure of the 14 alternative splicespecies found for human PG1 by the exons present. An * indicates thatthere is a stop codon in frame at that location. An arrow to the rightat the right-hand side of a splice species indicates that theopen-reading frame continues off of the chart. a space between exonsindicates that the exon(s) is missing from that particular alternativesplice species. An up arrow indicates that either exon 1bis, 3bis, or5bis has been inserted depending upon which is indicated. A bracketnotation in exon 6, over an exon 6bis notation indicates that the first60 bases is missing from exon 6, and exon 6bis is therefor present as atruncated form of exon 6.

[0069]FIG. 15 is a table listing the results of a series of RT-PCRexperiments that were performed on RNA of normal prostate, normalprostatic cell lines (PNT1A, PNT1B and PNT2), and tumoral prostatic celllines (LnCaPFCG, LnCaPJMB, CaHPV, Dul45, PC3, and prostate tumors (ECP5to ECP24) using all the possible combinations of primers (SEQ ID NOs:137-178) specific to all of the possible splice junctions or exonborders in human P61. An NT indicates that the experiment was notperformed. An [+] indicates the use of an alternative splice specieswith exons 1, 3, 4, 7, and 8.

[0070]FIG. 16 is a graph showing the results of association studiesusing markers spanning the 650 kb region of the 8p23 locus around PG1,using both single point analysis and haplotyping studies.

[0071]FIG. 17 is a graph showing an enlarged view of the single pointassociation results within a 160 kb region comprising the PG1 gene.

[0072]FIG. 18A is a graph showing an enlarged view of the single pointassociation results of 40 kb within the PG1 gene. FIG. 18B is a tablelisting the location of markers within PG1 gene, the two possiblealleles at each site. For each marker, the disease-associated allele isindicated first; its frequencies in cases and controls as well as thedifference between both are shown; the odd-ratio and the p-value of eachindividual marker association are also shown.

[0073]FIG. 19A is a table showing the results of a haplotype analysisstudy using 4 markers (marker Nos. 4-14, 99-217, 4-66 and 99-221) )within the 160 kb region shown in FIG. 17. FIG. 19B is a table showingthe segmented haplotyping results according to the subject's age, andwhether the prostate cancer cases were sporadic or familial, using thesame markers 4 markers and the same individuals as were used to generatethe results in FIG. 19A.

[0074]FIG. 20 is a table listing the haplotyping results and odd ratiosfor combinations of the 7 markers (99-622; 4-77; 4-71; 4-73 ; 99-598;99-576 ; 4-66) within PG1 gene th shown in FIG. 18 to have p-values moresignificant than 1.10⁻². All of the 2-, 3-, 4-, 5-, 6- and 7-markerhaplotypes were tested.

[0075]FIG. 21 is a graph showing the distribution of statisticalsignificance, as measured by Chi-square values, for each series ofpossible x-marker haplotypes, (x=2, 3 or 4) using all of the 19 markerslisted in FIG. 18B.

[0076]FIG. 22 is a block diagram of an exemplary computer system.

[0077]FIG. 23 is a flow diagram illustrating one embodiment of a process200 for comparing a new nucleotide or protein sequence with a databaseof sequences in order to determine the homology levels between the newsequence and the sequences in the database.

[0078]FIG. 24 is a flow diagram illustrating one embodiment of a process250 in a computer for determining whether two sequences are homologous.

[0079]FIG. 25 is a flow diagram illustrating one embodiment of anidentifier process 300 for detecting the presence of a feature in asequence.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0080] The practice of the present invention encompasses conventionaltechniques of chemistry, immunology, molecular biology, biochemistry,protein chemistry, and recombinant DNA technology, which are within theskill of the art. Such techniques are explained fully in the literature.See e.g., Oligonucleotide Synthesis (M. Gait ed. 1984); Nucleic AcidHybridization (B. Hames & S. Higgins, eds., 1984); Sambrook, Fritsch &Maniatis, Molecular Cloning: A Laboratory Manual, Second Edition (1989);PCR Technology (H. A. Erlich ed., Stockton Press); R. Scope, ProteinPurification Principles and Practice (Springer-Verlag); and the seriesMethods in Enzymology (S. Colowick and N. Kaplan eds., Academic Press,Inc.).

[0081] Definitions

[0082] As used interchangeably herein, the terms “nucleic acid”“oligonucleotide”, and “polynucleotides” include RNA, DNA, or RNA/DNAhybrid sequences of more than one nucleotide in either single chain orduplex form. The term “nucleotide” as used herein as an adjective todescribe molecules comprising RNA, DNA, or RNA/DNA hybrid sequences ofany length in single-stranded or duplex form. The term “nucleotide” isalso used herein as a noun to refer to individual nucleotides orvarieties of nucleotides, meaning a molecule, or individual unit in alarger nucleic acid molecule, comprising a purine or pyrimidine, aribose or deoxyribose sugar moiety, and a phosphate group, orphosphodiester linkage in the case of nucleotides within anoligonucleotide or polynucleotide. Although the term “nucleotide” isalso used herein to encompass “modified nucleotides” which comprise atleast one modifications (a) an alternative linking group, (b) ananalogous form of purine, (c) an analogous form of pyrimidine, or (d) ananalogous sugar, for examples of analogous linking groups, purine,pyrimidines, and sugars see for example PCT publication No. WO 95/04064.However, the polynucleotides of the invention are preferably comprisedof greater than 50% conventional deoxyribose nucleotides, and mostpreferably greater than 90% conventional deoxyribose nucleotides. Thepolynucleotide sequences of the invention is prepared by any knownmethod, including synthetic, recombinant, ex vivo generation, or acombination thereof, as well as utilizing any purification methods knownin the art.

[0083] As used herein, the term “purified” does not require absolutepurity; rather, it is intended as a relative definition Purification ofstarting material or natural material to at least one order ofmagnitude, preferably two or three orders, and more preferably four orfive orders of magnitude is expressly contemplated.

[0084] The term “purified” is used herein to describe a polynucleotideor polynucleotide vector of the invention which has been separated fromother compounds including, but not limited to other nucleic acids,charbohydrates, lipids and proteins (such as the enzymes used in thesynthesis of the polynucleotide), or the separation of covalently closedpolynucleotides from linear polynucleotides. A polynucleotide issubstantially pure when at least about 50%, preferably 60 to 75% of asample exhibits a single polynucleotide sequence and conformation(linear versus covalently close). A substantially pure polynucleotidetypically comprises about 50%, preferably 60 to 90% weight/weight of anucleic acid sample, more usually about 95%, and preferably is overabout 99% pure. Polynucleotide purity or homogeneity is indicated by anumber of means well known in the art, such as agarose or polyacrylamidegel electrophoresis of a sample, followed by visualizing a singlepolynucleotide band upon staining the gel. For certain purposes higherresolution can be provided by using HPLC or other means well known inthe art.

[0085] The term “polypeptide” refers to a polymer of amino withoutregard to the length of the polymer; thus, peptides, oligopeptides, andproteins are included within the definition of polypeptide. This termalso does not specify or exclude prost-expression modifications ofpolypeptides, for example, polypeptides which include the covalentattachment of glycosyl groups, acetyl groups, phosphate groups, lipidgroups and the like are expressly encompassed by the term polypeptide.Also included within the definition are polypeptides which contain oneor more analogs of an amino acid (including, for example, non-naturallyoccurring amino acids, amino acids which only occur naturally in anunrelated biological system, modified amino acids from mammalian systemsetc.), polypeptides with substituted linkages, as well as othermodifications known in the art, both naturally occurring andnon-naturally occurring.

[0086] As used herein, the term “isolated” requires that the material beremoved from its original environment (e.g., the natural environment ifit is naturally occurring).

[0087] The term “purified” is used herein to describe a polypeptide ofthe invention which has been separated from other compounds including,but not limited to nucleic acids, lipids, charbohydates and otherproteins. A polypeptide is substantially pure when at least about 50%,preferably 60 to 75% of a sample exhibits a single polypeptide sequence.A substantially pure polypeptide typically comprises about 50%,preferably 60 to 90% weight/weight of a protein sample, more usuallyabout 95%, and preferably is over about 99% pure. Polypeptide purity orhomogeneity is indicated by a number of means well known in the art,such as agarose or polyacrylamide gel electrophoresis of a sample,followed by visualizing a single polypeptide band upon staining the gel.For certain purposes higher resolution can be provided by using HPLC orother means well known in the art.

[0088] As used herein, the term “non-human animal” refers to anynon-human vertebrate, birds and more usually mammals, preferablyprimates, farm animals such as swine, goats, sheep, donkeys, and horses,rabbits or rodents, more preferably rats or mice. As used herein, theterm “animal” is used to refer to any vertebrate, preferable a mammal.Both the terms “animal” and “mammal” expressly embrace human subjectsunless preceded with the term “non-human”.

[0089] As used herein, the term “antibody” refers to a polypeptide orgroup of polypeptides which are comprised of at least one bindingdomain, where an antibody binding domain is formed from the folding ofvariable domains of an antibody molecule to form three-dimensionalbinding spaces with an internal surface shape and charge distributioncomplementary to the features of an antigenic determinant of anantigen., which allows an immunological reaction with the antigen.Antibodies include recombinant proteins comprising the binding domains,as wells as fragments, including Fab, Fab′, F(ab)₂, and F(ab′)₂fragments.

[0090] As used herein, an “antigenic determinant” is the portion of anantigen molecule, in this case an PG1 polypeptide, that determines thespecificity of the antigen-antibody reaction. An “epitope” refers to anantigenic determinant of a polypeptide. An epitope can comprise as fewas 3 amino acids in a spatial conformation which is unique to theepitope. Generally an epitope consists of at least 6 such amino acids,and more usually at least 8-10 such amino acids. Methods for determiningthe amino acids which make up an epitope include x-ray crystallography,2-dimensional nuclear magnetic resonance, and epitope mapping e.g. thePepscan method described by H. Mario Geysen et al. 1984. Proc. Natl.Acad. Sci. U.S.A. 81:3998-4002; PCT Publication No. WO 84/03564; and PCTPublication No. WO 84/03506.

[0091] The term “DNA construct” and “vector” are used herein to mean apurified or isolated polynucleotide that has been artificially designedand which comprises at least two nucleotide sequences that are not foundas contiguous nucleotide sequences in their natural environment.

[0092] The terms “trait” and “phenotype” are used interchangeably hereinand refer to any visible, detectable or otherwise measurable property ofan organism such as symptoms of, or susceptibility to a disease forexample. Typically the terms “trait” or “phenotype” are used herein torefer to symptoms of, or susceptibility to cancer or prostate cancer; orto refer to an individual's response to an anti-cancer agent or ananti-prostate cancer agent; or to refer to symptoms of, orsusceptibility to side effects to an anticancer agent or ananti-prostate cancer agent.

[0093] The term “allele” is used herein to refer to variants of anucleotide sequence. A biallelic polymorphism has two forms. Typicallythe first identified allele is designated as the original allele whereasother alleles are designated as alternative alleles. Diploid organismsis homozygous or heterozygous for an allelic form.

[0094] The term “heterozygosity rate” is used herein to refer to theincidence of individuals in a population, which are heterozygous at aparticular allele. In a biallelic system the heterozygosity rate is onaverage equal to 2P_(a)(1-P_(a)), where P_(a) is the frequency of theleast common allele. In order to be useful in genetic studies a geneticmarker should have an adequate level of heterozygosity to allow areasonable probability that a randomly selected person will beheterozygous.

[0095] The term “genotype” as used herein refers the identity of thealleles present in an individual or a sample. In the context of thepresent invention a genotype preferably refers to the description of thebiallelic marker alleles present in an individual or a sample. The term“genotyping” a sample or an individual for a biallelic marker consistsof determining the specific allele or the specific nucleotide carried byan individual at a biallelic marker.

[0096] The term “mutation” as used herein refers to a difference in DNAsequence between or among different genomes or individuals which has afrequency below 1%.

[0097] The term “haplotype” refers to a combination of alleles presentin an individual or a sample. In the context of the present invention ahaplotype preferably refers to a combination of biallelic marker allelesfound in a given individual and which is associated with a phenotype.

[0098] The term “polymorphism” as used herein refers to the occurrenceof two or more alternative genomic sequences or alleles between or amongdifferent genomes or individuals. “Polymorphic” refers to the conditionin which two or more variants of a specific genomic sequence can befound in a population. A “polymorphic site” is the locus at which thevariation occurs. A single nucleotide polymorphism is a single base pairchange. Typically a single nucleotide polymorphism is the replacement ofone nucleotide by another nucleotide at the polymorphic site. Deletionof a single nucleotide or insertion of a single nucleotide, also giverise to single nucleotide polymorphisms. In the context of the presentinvention “single nucleotide polymorphism” preferably refers to a singlenucleotide substitution. Typically, between different genomes or betweendifferent individuals, the polymorphic site is occupied by two differentnucleotides.

[0099] The terms “biallelic polymorphism” and “biallelic marker” areused interchangeably herein to refer to a nucleotide polymorphism havingtwo alleles at a fairly high frequency in the population. A “biallelicmarker allele” refers to the nucleotide variants present at a biallelicmarker site. Usually a biallelic marker is a single nucleotidepolymorphism. However, less commonly there are also insertions anddeletions of up to 5 nucleotides which constitute biallelic markers forthe purposes of the present invention. Typically the frequency of theless common allele of the biallelic markers of the present invention hasbeen validated to be greater than 1%, preferably the frequency isgreater than 10%, more preferably the frequency is at least 20% (i.e.heterozygosity rate of at least 0.32), even more preferably thefrequency is at least 30% (i.e. heterozygosity rate of at least 0.42). Abiallelic marker wherein the frequency of the less common allele is 30%or more is termed a “high quality biallelic marker.”

[0100] The location of nucleotides in a polynucleotide with respect tothe center of the polynucleotide are described herein in the followingmanner. When a polynucleotide has an odd number of nucleotides, thenucleotide at an equal distance from the 3′ and 5′ ends of thepolynucleotide is considered to be “at the center” of thepolynucleotide, and any nucleotide immediately adjacent to thenucleotide at the center, or the nucleotide at the center itself isconsidered to be “within 1 nucleotide of the center.” With an odd numberof nucleotides in a polynucleotide any of the five nucleotides positionsin the middle of the polynucleotide would be considered to be within 2nucleotides of the center, and so on. When a polynucleotide has an evennumber of nucleotides, there would be a bond and not a nucleotide at thecenter of the polynucleotide. Thus, either of the two centralnucleotides would be considered to be “within 1 nucleotide of thecenter” and any of the four nucleotides in the middle of thepolynucleotide would be considered to be “within 2 nucleotides of thecenter”, and so on.

[0101] The term “upstream” is used herein to refer to a location whichis toward the 5′ end of the polynucleotide from a specific referencepoint.

[0102] The terms “base paired” and “Watson & Crick base paired” are usedinterchangeably herein to refer to nucleotides which can be hydrogenbonded to one another be virtue of their sequence identities in a mannerlike that found in double-helical DNA with thymine or uracil residueslinked to adenine residues by two hydrogen bonds and cytosine andguanine residues linked by three hydrogen bonds (See Stryer, L.,Biochemistry, 4^(th) edition, 1995). The terms “complementary” or“complement thereof” are used herein to refer to the sequences ofpolynucleotides which is capable of forming Watson & Crick base pairingwith another specified polynucleotide throughout the entirety of thecomplementary region. This term is applied to pairs of polynucleotidesbased solely upon their sequences and not any particular set ofconditions under which the two polynucleotides would actually bind.

[0103] As used herein the term “PG1-related biallelic marker” relates toa set of biallelic markers in linkage disequilibrium with PG1. The termPG1-related biallelic marker includes all of the biallelic markers usedin the initial association studies shown below in Section I.D.,including those biallelic markers contained in SEQ ID NOs: 21-38 and57-62. The term PG1-related biallelic marker encompasses all of thefollowing polymorphisms positioned in SEQ ID 179, and listed by internalreference number, including: 5-63-169 G or C in position 2159; 5-63-453C or T in position 2443; 99-622-95 T or C in position 4452; 99-621-215 Tor C in position 5733; 99-619-141 G or A in position 8438; 4-76-222deletion of GT in position 11843; 4-76-361 C or T in position 11983;4-77-151 G or C in position 12080; 4-77-294 A or G in position 12221;4-71-33 G or T in position 12947;4-71-233 A or G in position 13147;4-71-280 G or A in position 13194; 4-71-396 G or C in position 13310;4-72-127 A or G in position 13342; 4-72-152 A or G in position 13367;4-72-380 deletion of A in position 13594; 4-73-134 G or C in position13680; 4-73-356 G or C in position 13902; 99-610-250 T or C in position16231; 99-610-93 A or T in position 16388; 99-609-225 A or T in position17608; 4-90-27 A or C in position 18034; 4-90-283 A or C in position18290; 99-607-397 T or C in position 18786; 99-602-295 deletion of A inposition 22835; 99-602-258 T or C in position 22872; 99-600-492 deletionof TATTG in position 25183; 99-600-483 T or G in position 25192;5-23-288 A or G in position 25614; 99-598-130 T or C in position 26911;99-592-139 A or T in position 32703; 99-217-277 C or T in position34491; 5-47-284 A or G in position 34756; 99-589-267 T or G in position34934; 99-589-41 G or C in position 35160; 99-12899-307 C or T inposition 39897; 4-12-68 A or G in position 40598; 99-582-263 T or C inposition 40816; 99-582-132 T or C in position 40947; 99-576-421 G or Cin position 45783; 4-13-51 C or T in position 47929; 4-13-328 A or T inposition 48206; 4-13-329 G or C in position 48207; 99-12903-381 C or Tin position 49282; 5-56-208 A or G in position 50037; 5-56-225 A or G inposition 50054; 5-56-272 A or G in position 50101; 5-56-391 G or T inposition 50220; 4-61-269 A or G in position 50440; 4-61-391 A or G inposition 50562; 4-63-99 A or G in position 50653; 4-62-120 A or G inposition 50660; 4-62-205 A or G in position 50745; 4-64-113 A or T inposition 50885; 4-65-104 A or G in position 51249; 5-28-300 A or G inposition 51333; 5-50-269 C or T in position 51435; 4-65-324 C or T inposition 51468; 5-71-129 G or C in position 51515; 5-50-391 G or C inposition 51557; 5-71-180 A or G in position 51566; 4-6740 C or T inposition 51632; 5-71-280 A or C in position 51666; 5-58-167 A or G inposition 52016; 5-30-325 C or T in position 52096; 5-58-302 A or T inposition 52151; 5-31-178 A or G in position 52282; 5-31-244 A or G inposition 52348; 5-31-306 deletion of A in position 52410; 5-32-190 C orT in position 52524; 5-32-246 C or T in position 52580; 5-32-378deletion of A in position 52712; 5-53-266 G or C in position 52772;5-60-158 C or T in position 52860; 5-60-390 A or G in position 53092;5-68-272 G or C in position 53272; 5-68-385 A or T in position 53389;5-66-53 deletion of GA in position 53511; 5-66-142 G or C in position53600; 5-66-207 A or G in position 53665; 5-37-294 A or G in position53815; 5-62-163 insertion of A in position 54365; 5-62-340 A or T inposition 54541; and the compliments thereof. The term PG1-relatedbiallelic marker also includes all of the following biallelic markerslisted by internal reference number, and two SEQ ID NOs each of whichcontains a 47-mers with one of the two alternative bases at position 24:

[0104] 4-14-107 of SEQ ID NOs 185 and 262; 4-14-317 of SEQ ID NOs 186and 263; 4-14-35

[0105] of SEQ ID NOs 187 and 264; 4-20-149 of SEQ ID NOs 188 and 265;

[0106] 4-20-77 of SEQ ID NOs 189and 266; 4-22-174 of SEQ ID NOs 190 and267;

[0107] 4-22-176 of SEQ ID NOs 191 and 268; 4-26-60 of SEQ ID NOs 192 and269;

[0108] 4-26-72 of SEQ ID NOs 193 and 270; 4-3-130 of SEQ ID NOs 194 and271;

[0109] 4-38-63 of SEQ ID NOs 195 and 272;

[0110] 4-38-83 of SEQ ID NOs 196 and 273; 44-152 of SEQ ID NOs 197 and274;

[0111] 4-4-187 of SEQ ID NOs 198 and 275; 44-288 of SEQ ID NOs 199 and276;

[0112] 4-42-304 of SEQ ID NOs 200 and 277; 442401 of SEQ ID NOs 201 and278;

[0113] 14-43-328 of SEQ ID NOs 202 and 279; 4-43-70 of SEQ ID NOs 203and 280;

[0114] 4-50-209 of SEQ ID NOs 204 and 28 1; 4-50-293 of SEQ ID NOs 205and 282;

[0115] 4-50-323 of SEQ ID NOs 206 and 283; 4-50-329 of SEQ ID NOs 207and 284;

[0116] 4-50-330 of SEQ ID NOs 208 and 285; 4-52-163 of SEQ ID NOs 209and 286;

[0117] 4-52-88 of SEQ ID NOs 210 and 287; 4-53-258 of SEQ ID NOs 211 and288;

[0118] 4-54-283 of SEQ ID NOs 212 and 289; 4-54-388 of SEQ ID NOs 213and 290;

[0119] 4-55-70 of SEQ ID NOs 214 and 291; 4-55-95 of SEQ ID NOs 215 and292;

[0120] 4-56-159 of SEQ ID NOs 216 and 293; 4-56-213 of SEQ ID NOs 217and 294;

[0121] 4-58-289 of SEQ ID NOs 218 and 295; 4-58-318 of SEQ ID NOs 219and 296;

[0122] 4-60-266 of SEQ ID NOs 220 and 297; 4-60-293 of SEQ ID NOs 221and 298;

[0123] 4-84-241 of SEQ ID NOs 222 and 299; 4-84-262 of SEQ ID NOs 223and 300;

[0124] 4-86-206 of SEQ ID NOs 224 and 301; 4-86-309 of SEQ ID NOs 225and 302;

[0125] 4-88-349 of SEQ ID NOs 226 and 303; 4-89-87 of SEQ ID NOs 227 and304;

[0126] 99-123-184 of SEQ ID NOs 228 and 305; 99-128-202 of SEQ ID NOs229 and 306;

[0127] 99-128-275 of SEQ ID NOs 230 and 307; 99-128-313 of SEQ ID NOs231 and 308;

[0128] 99-128-60 of SEQ ID NOs 232 and 309; 99-12907-295 of SEQ ID NOs233 and 310;

[0129] 99-130-58 of SEQ ID NOs 234 and 31 1; 99-134-362 of SEQ ID NOs235 and 312;

[0130] 99-140-130 of SEQ ID NOs 236 and 313; 99-1462-238 of SEQ ID NOs237 and 314; 99-147-181 of SEQ ID NOs 23:and315;99-1474-156of SEQ ID NOs239and3l6; 99-1474-359of SEQ ID NOs 240 and 317; 99-1479-158 of SEQ IDNOs 241 and 318; 99-1479-37 9 of SEQ ID NOs 242 and 319; 99-148-129 ofSEQ ID NOs 24 3 and 320; 99-148-132 of SEQ ID NOs 244 and 321;99-148-139 of SEQ ID NOs 245 and 322; 20 99-148-140 of SEQ ID NOs 246and 323; 99-148-182 of SEQ ID NOs 247 and 32 4; 99-148-366 of SEQ ID NOs248 and 325; 99-148-76 of SEQ ID NOs 249 and 326; 99-1480-290 of SEQ IDNOs 250 and 327; 99-1481-285 of SEQ ID NOs 251 and 328; 99-1484-101 ofSEQ ID NOs 252 and 329; 99-1484-328 of SEQ ID NOs 253 and 330;99-1485-2514of SEQ ID NOs 254 and 331; 99-1490-381 of SEQ ID NOs255and332; 99-1493-280 of SEQ ID NOs 256 and 333; 99-151-94 of SEQ IDNOs 257 and 334; 99-211-291 of SEQ ID NOs 258 and 335; 99-213-37 of SEQID NOs 259 and 336; 99-221-442 of SEQ ID NOs 260 and 337; 99-222-109 ofSEQ ID NOs 261 and 338; and the compliments thereof.

[0131] The term “non-genic” is used herein to describe PG1-relatedbiallelic markers, as well as polynucleotides and primers which do notoccur in the human PG1 genomic sequence of SEQ ID NO: 179. The term“genic” is used herein to describe PG1-related biallelic markers as wellas polynucleotides and primers which do occur in the human PG1 genomicsequence of SEQ ID NO: 179.

[0132] The terms “an anti-cancer agent” refers to a drug or a compoundthat is capable of reducing the growth rate, rate of metastasis, orviability of tumor cells in a mammal, is capable of reducing the size oreliminating tumors in a mammal, or is capable of increasing the averagelife span of a mammal or human with cancer. Anti-cancer agents alsoinclude compounds which are able to reduce the risk of cancer developingin a population, particularly a high risk population. The terms “ananti-prostate cancer agent” is an anti-cancer agent that has theseeffects on cells or tumors that are derived from prostate cancer cells.

[0133] The terms “response to an anti-cancer agent” and “response to ananti-prostate cancer agent” refer to drug efficacy, including but notlimited to ability to metabolize a compound, to the ability to convert apro-drug to an active drug, and to the pharmacokinetics (absorption,distribution, elimination) and the pharmacodynamics (receptor-related)of a drug in an individual.

[0134] The terms “side effects to an anti-cancer agent” and “sideeffects to an anti-prostate cancer agent” refer to adverse effects oftherapy resulting from extensions of the principal pharmacologicalaction of the drug or to idiosyncratic adverse reactions resulting froman interaction of the drug with unique host factors. These side effectsinclude, but are not limited to, adverse reactions such asdermatological, hematological or hepatological toxicities and furtherincludes gastric and intestinal ulceration, disturbance in plateletfunction, renal injury, nephritis, vasomotor rhinitis with profusewatery secretions, angioneurotic edema, generalized urticaria, andbronchial asthma to laryngeal edema and bronchoconstriction,hypotension, sexual dysfunction, and shock.

[0135] As used herein the term “homology” refers to comparisons betweenprotein and/or nucleic acid sequences and is evaluated using any of thevariety of sequence comparison algorithms and programs known in the art.Such algorithms and programs include, but are by no means limited to,TBLASTN, BLASTP, FASTA, TFASTA, and CLUSTALW (Pearson and Lipman, 1988,Proc. Natl. Acad. Sci. USA 85(8):2444-2448; Altschul et al., 1990, J.Mol. Biol. 215(3):403-410; Thompson et al., 1994, Nucleic Acids Res.22(2):4673-4680; Higgins et al. 1996, Methods Enzymol. 266:383402;Altschul et al., 1990, J. Mol. Biol. 215(3):403410; Altschul et al.,1993, Nature Genetics 3:266-272). In a particularly preferredembodiment, protein and nucleic acid sequence homologies are evaluatedusing the Basic Local Alignment Search Tool (“BLAST”) which is wellknown in the art (see, e.g., Karlin and Altschul, 1990, Proc. Natl.Acad. Sci. USA 87:2267-2268; Altschul et al., 1990, J. Mol. Biol.215:403-410; Altschul et al., 1993, Nature Genetics 3:266-272; Altschulet al., 1997, Nuc. Acids Res. 25:3389-3402). In particular, fivespecific BLAST programs are used to perform the following task:

[0136] (1) BLASTP and BLAST3 compare an amino acid query sequenceagainst a protein sequence database;

[0137] (2) BLASTN compares a nucleotide query sequence against anucleotide sequence database;

[0138] (3) BLASTX compares the six-frame conceptual translation productsof a query nucleotide sequence (both strands) against a protein sequencedatabase;

[0139] (4) TBLASTN compares a query protein sequence against anucleotide sequence database translated in all six reading frames (bothstrands); and

[0140] (5) TBLASTX compares the six-frame translations of a nucleotidequery sequence against the six-frame translations of a nucleotidesequence database.

[0141] The BLAST programs identify homologous sequences by identifyingsimilar segments, which are referred to herein as “high-scoring segmentpairs,” between a query amino or nucleic acid sequence and a testsequence which is preferably obtained from a protein or nucleic acidsequence database. High-scoring segment pairs are preferably identified(i.e., aligned) by means of a scoring matrix, many of which are known inthe art. Preferably, the scoring matrix used is the BLOSUM62 matrix(Gonnet et al., 1992, Science 256:1443-1445; Henikoff and Henikoff,1993, Proteins 17:49-61). Less preferably, the PAM or PAM250 matricesmay also be used (see, e.g., Schwartz and Dayhoff, eds., 1978, Matricesfor Detecting Distance Relationships: Atlas of Protein Sequence andStructure, Washington: National Biomedical Research Foundation). TheBLAST programs evaluate the statistical significance of all high-scoringsegment pairs identified, and preferably selects those segments whichsatisfy a user-specified threshold of significance, such as auser-specified percent homology. Preferably, the statisticalsignificance of a high-scoring segment pair is evaluated using thestatistical significance formula of Karlin (see, e.g., Karlin andAltschul, 1990, Proc. Natl. Acad. Sci. USA 87:2267-2268).

I. ISOLATION AND CHARACTERIZATION OF THE PG1 GENE AND PROTEINS I.A. The8p23 Region-LOH Studies: Implications of 8p23 Region in Distinct CancerTypes

[0142] Substantial amounts of LOH data support the hypothesis that genesassociated with distinct cancer types are located within 8p23 region ofthe human genome. Emi, et al., demonstrated the implication of8p²3.1-8p21.3 region in cases of hepatocellular carcinoma, colorectalcancer, and non-small cell lung cancer. (Emi M, Fujiwara Y, Nakajima T,Tsuchiya E, Tsuda H, Hirohashi S, Maeda Y, Tsuruta K, Miyaki M, NakamuraY, Cancer Res. Oct. 1, 1992; 52(19): 5368-5372) Yaremko, et al., showedthe existence of two major regions of LOH for chromosome 8 markers in asample of 87 colorectal carcinomas. The most prominent loss was foundfor 8p23.1-pter, where 45% of informative cases demonstrated loss ofalleles. (Yaremko M L, Wasylyshyn M L, Paulus K L, Michelassi F,Westbrook Calif., Genes Chromosomes Cancer May 1994 ; 10(1):1-6).Scholnick et al. demonstrated the existence of three distinct regions ofLOH for the markers of chromosome 8 in cases of squamous cell carcinomaof the supraglottic larynx. They showed that the allelic loss of 8p23marker D8S264 serves as a statistically significant, independentpredictor of poor prognosis for patients with supraglottic squamous cellcarcinoma. (Scholnick S B, Haughey B H, Sunwoo J B, el-Mofty S K, Baty JD, Piccirillo J F, Zequeira M R, J. Natl. Cancer Inst. Nov 20, 1996;88(22): 1676-1682 and Sunwoo J B, Holt M S, Radford D M, Deeker C,Scholnick S, Genes Chromosomes Cancer July 1996; 16(3):164-169).

[0143] In other studies, Nagai et al. demonstrated the highest loss ofheterozygosity in the specific region of 8p23 by genome wide scanning ofLOH in 120 cases of hepatocellular carcinoma (HCC). (Nagai H, Pineau P,Tiollais P, Buendia M A, Dejean A, Oncogene Jun. 19, 1997; 14(24):2927-2933). Gronwald et al. demonstrated 8p23-pter loss in renal clearcell carcinomas. (Gronwald J, Storkel S, Holtgreve-Grez H, Hadaczek P,Brinkschmidt C, Jauch A, Lubinski J, Cremer, Cancer Res. Feb. 1, 1997;57(3): 481487).

[0144] The same region is involved in specific cases of prostate cancer.Matsuyama et al. showed the specific deletion of the 8p23 band inprostate cancer cases, as monitored by FISH with D8S7 probe. (MatsuyamaH, Pan Y, Skoog L, Tribukait B, Naito K, Ekman P, Lichter P, BergerheimUS Oncogene October 1994; 9(10): 3071-3076). They were able to documenta substantial number of cases with deletions of 8p23 but retention ofthe 8p22 marker LPL. Moreover, Ichikawa et al. deduced the existence ofa prostate cancer metastasis suppressor gene and localized it to8p23-q12 by studies of metastasis suppression in highly metastatic ratprostate cells after transfer of human chromosomes. (Ichikawa T, NiheiN, Kuramochi H, Kawana Y, Killary A M, Rinker-Schaeffer C W, Barrett JC, Isaacs J T, Kugoh H, Oshimura M, Shimazaki J, Prostate Suppl. 1996;6: 31-35).

[0145] Recently Washbum et al. were able to find substantial numbers oftumors with the allelic loss specific to 8p23 by LOH studies of 31 casesof human prostate cancer. (Washbum J, Woino K, and Macoska J,Proceedings of American Association for Cancer Research, March 1997;38). In these samples they were able to define the minimal overlappingregion with deletions covering genetic interval D8S262-D8S277.

Linkage Analysis Studies: Search for Prostate Cancer Linked Regions onChromosome 8

[0146] Microsatellite markers mapping to chromosome 8 were used by theinventors to perform linkage analysis studies on 194 individuals issuedfrom 47 families affected with prostate cancer. While multiple pointanalysis led to weak linkage results, two point lod score analysis ledto non significant results, as shown below. Two point lod (parametricanalysis) Z(lod) Marker Distance (cM) scores D8S1742 −0.13 D8S561 0.8−0.07 # of families analyzed  47 Total # of individuals genotyped 194Total # of affected individuals genotyped 122

[0147] In view of the non-significant results obtained with linkageanalysis, a new mapping approach based on linkage disequilibrium ofbiallelic markers was utilised to identify genes responsible forsporadic cases of prostate cancer.

I.B. Linkage Disequilibrium Using Biallelic Markers To IdentifyCandidate Loci Responsible For Disease Linkage Disequilibrium

[0148] Once a chromosomal region has been identified as potentiallyharboring a candidate gene associated with a sporadic trait, anexcellent approach to refine the candidate gene's location within theidentified region is to look for statistical associations between thetrait and some marker genotype when comparing an affected (trait⁺) and acontrol (trait⁻) population.

[0149] Association studies have most usually relied on the use ofbiallelic markers. Biallelic markers are genome-derived polynucleotidesthat exhibit biallelic polymorphism at one single base position. Bydefinition, the lowest allele frequency of a biallelic polymorphism is1%; sequence variants that show allele frequencies below 1% are calledrare mutations. There are potentially more than 10⁷ biallelic markerslying along the human genome.

[0150] Association studies seek to establish correlations between traitsand genetic markers and are based on the phenomenon of linkagedisequilibrium (LD). LD is defined as the trend for alleles at nearbyloci on haploid genomes to correlate in the population. If two geneticloci lie on the same chromosome, then sets of alleles on the samechromosomal segment (i.e., haplotypes) tend to be transmitted as a blockfrom generation to generation. When not broken up by recombination,haplotypes can be tracked not only through pedigrees but also throughpopulations. The resulting phenomenon at the population level is thatthe occurrence of pairs of specific alleles at different loci on thesame chromosome is not random, and the deviation from random is calledlinkage disequilibrium.

[0151] Since results generated by association studies are essentiallybased on the quantitative calculation of allele frequencies, they bestapply to the analysis of germline mutations. This is mainly due to thefact that allelic frequencies are difficult to quantify within tumortissue samples because of the usual presence of normal cells within thestudied tumor samples. Association studies applied to cancer geneticswill therefore be best suited to the identification of tumor suppressorgenes.

Trait Localization by Linkage Disequilibrium Mapping

[0152] Any gene responsible or partly responsible for a given trait willbe in LD with some flanking markers. To map such a gene, specificalleles of these flanking markers which are associated with the gene orgenes responsible for the trait are identified. Although the followingdiscussion of techniques for finding the gene or genes associated with aparticular trait using linkage disequilibrium mapping, refers tolocating a single gene which is responsible for the trait, it will beappreciated that the same techniques may also be used to identify geneswhich are partially responsible for the trait.

[0153] Association studies is conducted within the general population(as opposed to the linkage analysis techniques discussed above which arelimited to studies performed on related individuals in one or severalaffected families).

[0154] Association between a biallelic marker A and a trait T mayprimarily occur as a result of three possible relationships between thebiallelic marker and the trait. First, allele a of biallelic marker A isdirectly responsible for trait T (e.g., Apo E e4 allele and Alzheimer'sdisease). However, since the majority of the biallelic markers used ingenetic mapping studies are selected randomly, they mainly map outsideof genes. Thus, the likelihood of allele a being a functional mutationdirectly related to trait T is therefore very low.

[0155] An association between a biallelic marker A and a trait T mayalso occur when the biallelic marker is very closely linked to the traitlocus. In other words, an association occurs when allele a is in linkagedisequilibrium with the trait-causing allele. When the biallelic markeris in close proximity to a gene responsible for the trait, moreextensive genetic mapping will ultimately allow a gene to be discoverednear the marker locus which carries mutations in people with trait T(i.e. the gene responsible for the trait or one of the genes responsiblefor the trait). As will be further exemplified below using a group ofbiallelic markers which are in close proximity to the gene responsiblefor the trait, the location of the causal gene can be deduced from theprofile of the association curve between the biallelic markers and thetrait. The causal gene will be found in the vicinity of the markershowing the highest association with the trait.

[0156] Finally, an association between a biallelic marker and a traitmay occur when people with the trait and people without the traitcorrespond to genetically different subsets of the population who,coincidentally, also differ in the frequency of allele a (populationstratification). This phenomenon is avoided by using large heterogeneoussamples.

[0157] Association studies are particularly suited to the efficientidentification of susceptibility genes that present commonpolymorphisms, and are involved in multifactorial traits whose frequencyis relatively higher than that of diseases with monofactorialinheritance.

Application of Linkage Disequilibrium Mapping to Candidate GeneIdentification

[0158] The general strategy of association studies using a set ofbiallelic markers, is to scan two pools of individuals (affectedindividuals and unaffected controls) characterized by a well definedphenotype in order to measure the allele frequencies for a number of thechosen markers in each of these pools. If a positive association with atrait is identified using an array of biallelic markers having a highenough density, the causal gene will be physically located in thevicinity of the associated markers, since the markers showing positiveassociation to the trait are in linkage disequilibrium with the traitlocus. Regions harboring a gene responsible for a particular trait whichare identified through association studies using high density sets ofbiallelic markers will, on average, be 20-40 times shorter in lengththan those identified by linkage analysis.

[0159] Once a positive association is confirmed as described above, BACs(bacterial artificial chromosomes) obtained from human genomiclibraries, constructed as described below, harboring the markersidentified in the association analysis are completely sequenced.

[0160] Once a candidate region has been sequenced and analyzed, thefunctional sequences within the candidate region (exons and promoters,and other potential regulatory regions) are scanned for mutations whichare responsible for the trait by comparing the sequences of a selectednumber of controls and affected individuals using appropriate software.Candidate mutations are further confirmed by screening a larger numberof affected individuals and controls using the microsequencingtechniques described below.

[0161] Candidate mutations are identified as follows. A pair ofoligonucleotide primers is designed in order to amplify the sequences ofevery predicted functional region. PCR amplification of each predictedfunctional sequence is carried out on genomic DNA samples from affectedpatients and unaffected controls. Amplification products from genomicPCR are subjected to automated dideoxy terminator sequencing reactionsand electrophoresed on ABI 377 sequencers. Following gel image analysisand DNA sequence extraction, the sequence data are automaticallyanalyzed to detect the presence of sequence variations among affectedcases and unaffected controls. Sequences are systematically verified bycomparing the sequences of both DNA strands of each individual.

[0162] Polymorphisms are then verified by screening a larger populationof affected individuals and controls using the microsequencing techniquedescribed below in an individual test format. Polymorphisms areconsidered as candidate mutations when present in affected individualsand controls at frequencies compatible with the expected associationresults.

Association Studies: Statistical Analysis and Haplotyping

[0163] As mentioned above, linkage analysis typically localizes adisease gene to a chromosomal region of several megabases. Furtherrefinement in location requires the analysis of additional families inorder to increase the number of recombinants. However, this approachbecomes unfeasible because recombination is rarely observed even withinlarge pedigrees (Boehnke, M, 1994, Am. J. Hum. Genet. 55: 379-390).

[0164] Linkage disequilibrium, the nonrandom association of alleles atlinked loci, may offer an alternative method of obtaining additionalrecombinants. When a chromosome carrying a mutant allele of a generesponsible for a given trait is first introduced into a population as aresult of either mutation or migration, the mutant allele necessarilyresides on a chromosome having a unique set of linked markers(haplotype). Consequently, there is complete disequilibrium betweenthese markers and the disease mutation: the disease mutation is presentonly linked to a specific set of marker alleles. Through subsequentgenerations, recombinations occur between the disease mutation and thesemarker polymorphisms, resulting in a gradual disappearance ofdisequilibrium. The degree of disequilibrium dissipation depends on therecombination frequency, so the markers closest to the disease gene willtend to show higher levels of disequilibrium than those that are fartheraway (Jorde LB, 1995, Am. J. Hum. Genet. 56: 11-14). Because linkagedisequilibrium patterns in a present-day population reflect the actionof recombination through many past generations, disequilibrium analysiseffectively increases the sample of recombinants. Thus the mappingresolution achieved through the analysis of linkage disequilibriumpatterns is much higher than that of linkage analysis.

[0165] In practice, in order to define the regions bearing a candidategene, the affected and control populations are genotyped using anappropriate number of biallelic markers (at a density of 1 marker every50-150 kilobases). Then, a marker/trait association study is performedthat compares the genotype frequency of each biallelic marker in theaffected and control populations by means of a chi square statisticaltest (one degree of freedom).

[0166] After the first screening, additional markers within the regionshowing positive association are genotyped in the affected and controlpopulations. Two types of complementary analysis are then performed.First, a marker/trait association study (as described above) isperformed to refine the location of the gene responsible for the trait.In addition, a haplotype association analysis is performed to define thefrequency and the type of the ancestral/preferential carrier haplotype.Haplotype analysis, by combining the informativeness of a set ofbiallelic markers increases the power of the association analysis,allowing false positive and/or negative data that may result from thesingle marker studies to be eliminated.

[0167] The haplotype analysis is performed by estimating the frequenciesof all possible haplotypes for a given set of biallelic markers in thecase and control populations, and comparing these frequencies by meansof a chi square statistical test (one degree of freedom). Haplotypeestimations are performed by applying the Expectation-Maximization (EM)algorithm (Excoffier L & Slatkin M, 1995, Mol. Biol. Evol. 12: 921-927),using the EM-HAPLO program (Hawley M E, Pakstis A J & Kidd K K, 1994,Am. J. Phys. Anthropol. 18: 104). The EM algorithm is used to estimatehaplotype frequencies in the case when only genotype data from unrelatedindividuals are available. The EM algorithm is a generalized iterativemaximum likelihood approach to estimation that is useful when data areambiguous and/or incomplete.

[0168] The application of biallelic marker based linkage disequilibriumanalysis to the 8p23 region to identify a gene associated with prostatecancer is described below.

I.C. Application of Linkage Disequilibrium Mapping to the 8p23 RegionYAC Contig Construction in 8p23 Region

[0169] First, a YAC contig which contains the 8p23 region wasconstructed as follows. The CEPH-Genethon YAC map for the entire humangenome (Chumakov I. M. et al. A YAC contig map of the human genome,Nature, 377 Supp.: 175-297, 1995) was used for detailed contig buildingin the region around D8S262 and D8S277 genetic markers. Screening dataavailable for regional genetic markers D8S1706, D8S277, D8S1742, D8S518,D8S262, D8S1798, D8S1140, D8S561 and D8S1819 were used to select thefollowing set of CEPH YACs, localized within this region: 832_g_(—)12,787_c_(—)11, 920_h_(—)7, 807_a_(—)1, 842_b_(—)1, 745_a_(—)3, 910_d_(—)3,879_f_(—)11, 918_c_(—)6, 764_c_(—)7, 910_f_(—)12, 967_c_(—)11,856_d_(—)8, 792_a_(—)6, 812_h_(—)4 873_c_(—)8, 930_a_(—)2, 807_a_(—)1,852_d_(—)10. This set of YACs was tested by PCR with the above mentionedgenetic markers as well as with other publicly available markerssupposedly located within the 8p23 region. As a result of these studies,a YAC STS contig map was generated around genetic markers D8S262 andD8S277. The two CEPH YACs, 920_h_(—)7 (1170 kb insert size) and910_f_(—)12(1480 kb insert size) constitute a minimal tiling path inthis region, with an estimated size of ca. 2 Megabases.

[0170] During this mapping effort, the following publicly known STSmarkers were precisely located within the contig: WI-14718, WI-3831,D8S1413E, WI-8327, WI-3823, ND4.

BAC Contig Construction Covering D8S262-D8S277 Fragment Within 8p23Region of the Human Genome

[0171] Following construction of the YAC contig, a BAC contig wasconstructed as follows. BAC libraries were obtained as described in Wooet al. Nucleic Acids Res., 1994, 22, 4922-4931. Briefly, two differentwhole human genome libraries were produced by cloning BamHI or HindIIIpartially digested DNA from a lymphoblastoid cell line (derived fromindividual N°8445, CEPH families) into the pBeloBAC11 vector (Kim et al.Genomics, 1996, 34, 213-218). The library produced with the BamHIpartial digestion contained 110,000 clones with an average insert sizeof 150 kb, which corresponds to 5 human haploid genome equivalents. Thelibrary prepared with the HindIII partial digestion corresponds to 3human genome equivalents with an average insert size of 150 kb.

BAC Screening

[0172] The human genomic BAC libraries obtained as described above werescreened with all of the above mentioned STSs. DNA from the clones inboth libraries was isolated and pooled in a three dimensional formatready for PCR screening with the above mentioned STSs using highthroughput PCR methods (Chumakov et al., Nature 1995, 377: 175-298).Briefly, three dimensional pooling consists in rearranging the samplesto be tested in a manner which allows the number of PCR reactionsrequired to screen the clones with STSs to be reduced by at least 100fold, as compared to screening each clone individually. PCRamplification products were detected by conventional agarose gelelectrophoresis combined with automated image capturing and processing.

[0173] In a final step, STS-positive clones were checked individually.Subchromosomal localization of BACs was systematically verified byfluorescence in situ hybridization (FISH), performed on metaphasicchromosomes as described by Cherif et al. Proc. Natl. Acad. Sci. USA1990, 87: 6639-6643.

[0174] BAC insert size was determined by Pulsed Field GelElectrophoresis after digestion with restriction enzyme NotI.

BAC Contig Analysis

[0175] The ordered BACs selected by STS screening and verified by FISH,were assembled into contigs and new markers were generated by partialsequencing of insert ends from some of them. These markers were used tofill the gaps in the contig of BAC clones covering the chromosomalregion around D8S277, having an estimated size of 2 megabases. SelectedBAC clones from the contig were subcloned and sequenced.

BAC Subcloning

[0176] Each BAC human DNA was first extracted using the alkaline lysisprocedure and then sheared by sonication. The obtained DNA fragmentswere end-repaired and electrophoresed on preparative agarose gels. Thefragments in the desired size range were isolated from the gel, purifiedand ligated to a linearized, dephosphorylated, blunt-ended plasmidcloning vector (pBluescript II Sk (+)). Example 1 describes the BACSubcloning procedure.

EXAMPLE 1

[0177] The cells obtained from three liters overnight culture of eachBAC clone were treated by alkaline lysis using conventional techniquesto obtain the BAC DNA containing the genomic DNA inserts. Aftercentrifugation of the BAC DNA in a cesium chloride gradient, ca. 50 μgof BAC DNA was purified. 5-10 μg of BAC DNA was sonicated using threedistinct conditions, to obtain fragments of the desired size. Thefragments were treated in a 50 μl volume with two units of Ventpolymerase for 20 min at 70° C., in the presence of the fourdeoxytriphosphates (100 μM). The resulting blunt-ended fragments wereseparated by electrophoresis on low-melting point 1% agarose gels (60Volts for 3 hours). The fragments were excised from the gel and treatedwith agarase. After chloroform extraction and dialysis on Microcon 100columns, DNA in solution was adjusted to a 100 ng/μl concentration. Aligation was performed overnight by adding 100 ng of BAC fragmented DNAto 20 ng of pBluescript II Sk (+) vector DNA linearized by enzymaticdigestion, and treated by alkaline phosphatase. The ligation reactionwas performed in a 10 μl final volume in the presence of 40 units/μl T4DNA ligase (Epicentre). The ligated products were electroporated intothe appropriate cells (ElectroMAX E.coli DH10B cells). IPTG and X-galwere added to the cell mixture, which was then spread on the surface ofan ampicillin-containing agar plate. After overnight incubation at 37°C., recombinant (white) colonies were randomly picked and arrayed in 96well microplates for storage and sequencing.

Partial Sequencing of BACs

[0178] At least 30 of the obtained BAC clones were sequenced by the endpair-wise method (500 bp sequence from each end) using a dye-primercycle sequencing procedure. Pair-wise sequencing was performed until amap allowing the relative positioning of selected markers along thecorresponding DNA region was established. Example 2 describes thesequencing and ordering of the BAC inserts.

EXAMPLE 2

[0179] The subcloned inserts were amplified by PCR on overnightbacterial cultures, using vector primers flanking the insertions. Theinsert extremity sequences (on average 500 bases at each end) weredetermined by fluorescent automated sequencing on ABI 377 sequencers,with a ABI Prism DNA Sequencing Analysis software (2.1.2 version).

[0180] The sequence fragments from BAC subcloned were assembled usingGap4 software from R. Staden (Bonfield et al. 1995). This softwareallows the reconstruction of a single sequence from sequence fragments.The sequence deduced from the alignment of different fragments is calledthe consensus sequence. We used directed sequencing techniques (primerwalking) to complete sequences and link contigs.

[0181]FIG. 1 shows the overlapping BAC subcloned (labeled BAC) whichmake up the assembled contig and the positions of the publicly known STSmarkers along the contig.

Identification of Biallelic Markers Lying Along the BAC Contig

[0182] Following assembly of the BAC contig, biallelic markers lyingalong the contig were then identified. Given that the assesseddistribution of informative biallelic markers in the human genome(biallelic polymorphisms with a heterozygosity rate higher than 42%) isone in 2.5 to 3 kb, six 500 bp genomic fragments have to be screened inorder to identify 1 biallelic marker. Six pairs of primers per potentialmarker, each one defining a ca. 500 bp amplification fragment, werederived from the above mentioned BAC partial sequences. All primerscontained a common upstream oligonucleotide tail enabling the easysystematic sequencing of the resulting amplification fragments.Amplification of each BAC-derived sequence was carried out on pools ofDNA from ca. 100 individuals. The conditions used for the polymerasechain reaction were optimized so as to obtain more than 95% of PCRproducts giving 500bp-sequence reads.

[0183] The amplification products from genomic PCR using theoligonucleotides derived from the BAC subcloned were subjected toautomated dideoxy terminator sequencing reactions using a dye-primercycle sequencing protocol. Following gel image analysis and DNA sequenceextraction, sequence data were automatically processed with appropriatesoftware to assess sequence quality and to detect the presence ofbiallelic sites among the pooled amplified fragments. Biallelic siteswere systematically verified by comparing the sequences of both strandsof each pool.

[0184] The detection limit for the frequency of biallelic polymorphismsdetected by sequencing pools of 100 individuals is 0.3+/−0.05 for theminor allele, as verified by sequencing pools of known allelicfrequencies. Thus, the biallelic markers selected by this method will be“informative biallelic markers” since they have a frequency of 0.3 to0.5 for the minor allele and 0.5 to 0.7 for the major allele, thereforean average heterozygosity rate higher than 42%.

[0185] Example 3 describes the preparation of genomic DNA samples fromthe individuals screened to identify biallelic markers.

EXAMPLE 3

[0186] The population used in order to generate biallelic markers in theregion of interest consisted of ca. 100 unrelated individualscorresponding to a French heterogeneous population.

[0187] DNA was extracted from peripheral venous blood of each donor asfollows.

[0188] 30 ml of blood were taken in the presence of EDTA. Cells (pellet)were collected after centrifugation for 10 minutes at 2000 rpm. Redcells were lysed by a lysis solution (50 ml final volume: 10 mM TrispH7.6; 5 mM MgCl₂; 10 mM NaCl). The solution was centrifuged (10minutes, 2000 rpm) as many times as necessary to eliminate the residualred cells present in the supernatant, after resuspension of the pelletin the lysis solution.

[0189] The pellet of white cells was lysed overnight at 42° C. with 3.7ml of lysis solution composed of:

[0190] 3 ml TE 10-2 (Tris-HCl 10 mM, EDTA 2 mM) NaCl 0.4 M

[0191] 200 μl SDS 10%

[0192] 500 μl K-proteinase (2 mg K-proteinase in TE 10-2/ NaCl 0.4 M).

[0193] For the extraction of proteins, 1 ml saturated NaCl (6M) (1/3.5v/v) was added. After vigorous agitation, the solution was centrifugedfor 20 minutes at 10000 rpm.

[0194] For the precipitation of DNA, 2 to 3 volumes of 100% ethanol wereadded to the previous supernatant, and the solution was centrifuged for30 minutes at 2000 rpm. The DNA solution was rinsed three times with 70%ethanol to eliminate salts, and centrifuged for 20 minutes at 2000 rpm.The pellet was dried at 37° C., and resuspended in 1 ml TE 10-1 or 1 mlwater. The DNA concentration was evaluated by measuring the OD at 260 nm(1 unit OD=50 μg/ml DNA).

[0195] To determine the presence of proteins in the DNA solution, the OD260/OD 280 ratio was determined. Only DNA preparations having a OD260/OD 280 ratio between 1.8 and 2 were used in the subsequent stepsdescribed below.

DNA Amplification

[0196] Once each BAC was isolated, pairs of primers, each one defining a500 bp-amplification fragment, were designed. Each of the primerscontained a common oligonucleotide tail upstream of the specific basestargeted for amplification, allowing the amplification products fromeach set of primers to be sequenced using the common sequence as asequencing primer. The primers used for the genomic amplification ofsequences derived from BACs were defined with the OSP software (HillierL. and Green P. Methods Appl., 1991, 1: 124-8). The synthesis of primerswas performed following the phosphoramidite method, on a GENSET UFPS24.1 synthesizer.

[0197] Example 4 provides the procedures used in the amplificationreactions.

EXAMPLE 4

[0198] The amplification of each sequence was performed by PCR(Polymerase Chain Reaction) as follows: final volume 50 μl genomic DNA100 ng MgCl₂ 2 mM dNTP (each) 200 μM primer (each) 7.5 pmoles Ampli TaqGold DNA polymerase (Perkin) 1 unit PCR buffer (10 X = 0.1 M Tris HCl pH8.3, 0.5 M KCl) 1 X.

[0199] The amplification was performed on a Perkin Elmer 9600Thermocycler or MJ Research PTC200 with heating lid. After heating at94° C. for 10 minutes, 35 cycles were performed. Each cycle comprised:30 sec at 94° C., 1 minute at 55° C., and 30 sec at 72° C. finalelongation, 7 minutes at 72° C. ended the amplification.

[0200] The obtained quantity of amplification products was determined on96-well microtiter plates, using a fluorimeter and Picogreen asintercalating agent (Molecular Probes).

[0201] The sequences of the amplification products were determined foreach of the approximately 100 individuals from whom genomic DNA wasobtained. Those amplification products which contained biallelic markerswere identified.

[0202]FIG. 1 shows the locations of the biallelic markers along the 8p23BAC contig. This first set of markers corresponds to a medium densitymap of the candidate locus, with an inter-marker distance averaging 50kb-150 kb.

[0203] A second set of biallelic markers was then generated as describedabove in order to provide a very high-density map of the regionidentified using the first set of markers which can be used to conductassociation studies, as explained below. The high density map hasmarkers spaced on average every 2-50 kb.

[0204] The biallelic markers were then used in association studies asdescribed below.

Collection of DNA samples from affected and non-affected individuals

[0205] Prostate cancer patients were recruited according to clinicalinclusion criteria based on pathological or radical prostatectomyrecords. Control cases included in this study were both ethnically- andage-matched to the affected cases; they were checked for both theabsence of all clinical and biological criteria defining the presence orthe risk of prostate cancer, and for the absence of related familialprostate cancer cases. Both affected and control individualscorresponded to unrelated cases.

[0206] The two following pools of independent individuals were used inthe association studies. The first pool, comprising individualssuffering from prostate cancer, contained 185 individuals. Of these 185cases of prostate cancer, 45 cases were sporadic and 140 cases werefamilial. The second pool, the control pool, contained 104 non-diseasedindividuals.

[0207] Haplotype analysis was conducted using additional diseased (totalsamples: 281) and control samples (total samples: 130), from individualsrecruited according to similar criteria.

Genotyping Affected and Control Individuals

[0208] The general strategy to perform the association studies was toindividually scan the DNA samples from all individuals in each of thetwo populations described above in order to establish the allelefrequencies of the above described biallelic markers in each of thesepopulations.

[0209] Allelic frequencies of the above-described biallelic markers ineach population were determined by performing microsequencing reactionson amplified fragments obtained by genomic PCR performed on the DNAsamples from each individual.

[0210] DNA samples and amplification products from genomic PCR wereobtained in similar conditions as those described above for thegeneration of biallelic markers, and subjected to automatedmicrosequencing reactions using fluorescent ddNTPs (specificfluorescence for each ddNTP) and the appropriate oligonucleotidemicrosequencing primers which hybridized Just upstream of thepolymorphic base. Once specifically extended at the 3′ end by a DNApolymerase using the complementary fluorescent dideoxynucleotide analog(thermal cycling), the primer was precipitated to remove theunincorporated fluorescent ddNTPs. The reaction products were analyzedby electrophoresis on ABI 377 sequencing machines.

[0211] Example 5 describes one microsequencing procedure.

EXAMPLE 5

[0212] 5 μl of PCR products in a microtiter plate were added to 5 μlpurification mix {2U SAP (Amersham); 2U Exonuclease I (Amersham); 1 μlSAP10X buffer: 400 mM Tris-HCl pH8, 100 mM MgCl2; H2O final volume 5μl}. The reaction mixture was incubated 30 minutes at 37° C., anddenatured 10 minutes at 94° C. After 10 sec centrifugation, themicrosequencing reaction was performed on line with the whole purifiedreaction mixture (10 μl) in the microplate using 10 pmol microsequencingoligonucleotide (23mers, GENSET, crude synthesis, 5 OD), 0.5 UThermosequenase (Amersham), 1.25 μl Thermosequenase 16×buffer(Amersham), both of the fluorescent ddNTPs (Perkin Elmer) correspondingto the polymorphism {0.025 μl ddTTP and ddCTP, 0.05 μl ddATP and ddGTP},H2O to a final volume of 20 μl. A PCR program on a GeneAmp 9600thermocycler was carried out as follows: 4 minutes at 94° C.; 5 sec at55° C./10 sec at 94° C. for 20 cycles. The re was incubated at 4° C.until precipitation. The microtiter plate was centrifuged 10 sec at 1500rpm. 19 μl MgCl2 2 mM and 55 μl 100% ethanol were added in each well.After 15 minute incubation at room temperature, the microtiter plate wascentrifuged at 3300 rpm 15 minutes at 4° C. Supernatants were discardedby inverting the microtitre plate on a box folded to proper size and bycentrifugation at 300 rpm 2 minutes at 4° C. afterwards. The microplatewas then dried 5 minutes in a vacuum drier. The pellets were resuspendedin 2.5 μl formamide EDTA loading buffer (0.7 μl of 9 μg/μl dextran bluein 25 mM EDTA and 1.8 μl formamide). A 10% polyacrylamide gel/12 cm/64wells was pre-run for 5 minutes on a 377 ABI377 sequencer. After 5minutes denaturation at 100° C., 0.8 μl of each microsequencing reactionproduct was loaded in each well of the gel. After migration (2 h 30 for2 microtiter plates of PCR products per gel), the fluorescent signalsemitted by the incorporated ddNTPs were analyzed on the ABI 377sequencer using the GENESCAN software (Perkin Elmer).Following gelanalysis, data were automatically processed with a software that allowedthe determination of the alleles of biallelic markers present in eachamplified fragment.

I.D. Initial Association Studies

[0213] Association studies were run in two successive steps. In a firststep, a rough localization of the candidate gene was achieved bydetermining the frequencies of the biallelic markers of FIG. 1 in theaffected and unaffected populations. The results of this roughlocalization are shown in FIG. 2. This analysis indicated that a generesponsible for prostate cancer was located near the biallelic markerdesignated 4-67.

[0214] In a second phase of the analysis, the position of the generesponsible for prostate cancer was further refined using the very highdensity set of markers described above. The results of this localizationare shown in FIG. 3.

[0215] As shown in FIG. 3, the second phase of the analysis confirmedthat the gene responsible for prostate cancer was near the biallelicmarker designated 4-67, most probably within a ca. 150 kb regioncomprising the marker.

Haplotype analysis

[0216] The allelic frequencies of each of the alleles of biallelicmarkers 99-123, 4-26, 4-14, 4-77, 99-217, 4-67, 99-213, 99-221, and99-135 (SEQ ID NOs: 21-38) were determined in the affected andunaffected populations. Table 1 lists the internal identificationnumbers of the markers used in the haplotype analysis (SEQ ID NOs:21-38), the alleles of each marker, the most frequent allele in bothunaffected individuals and individuals suffering from prostate cancer,the least frequent allele in both unaffected individuals and individualssuffering from prostate cancer, and the frequencies of these alleles ineach population.

[0217] Among all the theoretical potential different haplotypes based on2 to 9 markers, 11 haplotypes showing a strong association with prostatecancer were selected. The results of these haplotype analyses are shownin FIG. 4.

[0218]FIGS. 2, 3, and 4 aggregate linkage analysis results withsequencing results which permitted the physical order and/or thedistance between markers to be estimated.

[0219] The significance of the values obtained in FIG. 4 are underscoredby the following results of computer simulations. For the computersimulations, the data from the affected individuals and the unaffectedcontrols were pooled and randomly allocated to two groups whichcontained the same number of individuals as the affected and unaffectedgroups used to compile the data summarized in FIG. 4. A haplotypeanalysis was run on these artificial groups for the six markers includedin haplotype 5 of FIG. 4. This experiment was reiterated 100 times andthe results are shown in FIG. 5. Among 100 iterations, only 5% of theobtained haplotypes are present with a p-value below 1×10⁻⁴ as comparedto the p-value of 9×10⁻⁷ for haplotype 5 of FIG. 4. Furthermore, forhaplotype 5 of FIG. 4, only 6% of the obtained haplotypes have asignificance level below 5×10⁻³, while none of them show a significancelevel below 5×10⁻⁵.

[0220] Thus, using the data of FIG. 4 and evaluating the associationsfor single maker alleles or for haplotypes will permit estimation of therisk a corresponding carrier has to develop prostate cancer.Significance thresholds of relative risks will be adapted to thereference sample population used.

[0221] The diagnostic techniques may employ a variety of methodologiesto determine whether a test subject has a biallelic marker patternassociated with an increased risk of developing prostate cancer orsuffers from prostate cancer resulting from a mutant PG1 allele. Theseinclude any method enabling the analysis of individual chromosomes forhaplotyping, such as family studies, single sperm DNA analysis orsomatic hybrids.

[0222] In each of these methods, a nucleic acid sample is obtained fromthe test subject and the biallelic marker pattern for one or more of thebiallelic markers listed in FIGS. 4, 6A and 6B is determined. Thebiallelic markers listed in FIG. 6A are those which were used in thehaplotype analysis of FIG. 4. The first column of FIG. 6A lists the BACclones in which the biallelic markers lie. The second column of FIG. 6Alists the internal identification number of the marker. The third columnof FIG. 6A lists the sequence identification number for a first alleleof the biallelic markers. The fourth column of FIG. 6A lists thesequence identification number for a second allele of the biallelicmarkers. For example, the first allele of the biallelic marker 99-123has the sequence of SEQ ID NO: 21 and the second allele of the biallelicmarker has the sequence of SEQ ID NO: 30.

[0223] The fifth column of FIG. 6A lists the sequences of upstreamprimers which is used to generate amplification products containing thepolymorphic bases of the biallelic markers. The sixth column of FIG. 6Alists the sequence identification numbers for the upstream primers.

[0224] The seventh column of FIG. 6A lists the sequences of downstreamprimers which is used to generate amplification products containing thepolymorphic bases of the biallelic markers. The eighth column of FIG. 6Alists the sequence identification numbers for the downstream primers.

[0225] The ninth column of FIG. 6A lists the position of the polymorphicbase in the amplification products generated using the upstream anddownstream primers. The tenth column lists the identities of thepolymorphic bases found at the polymorphic positions in the biallelicmarkers. The eleventh and twelfth columns list the locations ofmicrosequencing primers in the biallelic markers which can be used todetermine the identities of the polymorphic bases.

[0226] In addition to the biallelic markers of SEQ ID NOs: 21-38, otherbiallelic markers (designated 99-1482, 4-73, 4-65) have been identifiedwhich are closely linked to one or more of the biallelic markers of SEQID NOs: 21-38, SEQ ID NOs: 57-62, and the PG1 gene. These biallelicmarkers include the markers of SEQ ID NOs: 57-62, which are listed inFIG. 6B. The columns in FIG. 6B are identical to the correspondingcolumns in FIG. 6A. SEQ ID NOs: 58, 59, 61, and 62 lie within the PG1gene of SEQ ID NO:I at the positions indicated in the accompanyingSequence Listing.

[0227] Genetic analysis of these additional biallelic markers isperformed as follows. Nucleic acid samples are obtained from individualssuffering from prostate cancer and unaffected individuals. Thefrequencies at which each of the two alleles occur in the affected andunaffected populations is determined using the methodologies describedabove. Association values are calculated to determine the correlationbetween the presence of a particular allele or spectrum of alleles andprostate cancer. The markers of SEQ ID NOs: 21-38 may also be includedin the analysis used to calculate the risk factors. The markers of SEQID NOs: 21-38 and SEQ ID NOs: 57-62 is used in diagnostic techniques,such as those described below, to determine whether an individual is atrisk for developing prostate cancer or suffers from prostate cancer as aresult of a mutation in the PG1 gene.

[0228] Example 6 describes methods for determining the biallelic markerpattern.

EXAMPLE 6

[0229] A nucleic acid sample is obtained from an individual to be testedfor susceptibility to prostate cancer or PG1 mediated prostate cancer.The nucleic acid sample is an RNA sample or a DNA sample.

[0230] A PCR amplification is conducted using primer pairs whichgenerate amplification products containing the polymorphic nucleotidesof one or more biallelic markers associated with prostate cancer-relatedforms of PG1, such as the biallelic markers of SEQ ID NOs: 21-38, SEQ IDNOs: 57-62, biallelic markers which are in linkage disequilibrium withthe biallelic markers of SEQ ID NOs: 21-38, SEQ ID NOs: 57-62, biallelicmarkers in linkage disequilibrium with the PG1 gene, or combinationsthereof. In some embodiments, the PCR amplification is conducted usingprimer pairs which generate amplification products containing thepolymorphic nucleotides of several biallelic markers. For example, inone embodiment, amplification products containing the polymorphic basesof several biallelic markers selected from the group consisting of SEQID NOs: 21-38, SEQ ID NOs: 57-62, and biallelic markers which are inlinkage disequilibrium with the biallelic markers of SEQ ID NOs: 21-38,SEQ ID NOs: 57-62 or with the PG1 gene is generated. In anotherembodiment, amplification products containing the polymorphic bases oftwo or more biallelic markers selected from the group consisting of SEQID NOs: 21-38, SEQ ID NOs: 57-62, and biallelic markers which are inlinkage disequilibrium with the biallelic markers of SEQ ID NOs: 21-38,SEQ ID NOs: 57-62 or with the PG1 gene is generated. In anotherembodiment, amplification products containing the polymorphic bases offive or more biallelic markers selected from the group consisting of SEQID NOs: 21-38, SEQ ID NOs: 57-62, and biallelic markers which are inlinkage disequilibrium with the biallelic markers of SEQ ID NOs: 21-38,SEQ ID NOs: 57-62 or with the PG1 gene is generated. In anotherembodiment, amplification products containing the polymorphic bases ofmore than five of the biallelic markers selected from the groupconsisting of SEQ ID NOs: 21-38, SEQ ID NOs: 57-62, and biallelicmarkers which are in linkage disequilibrium with the biallelic markersof SEQ ID NOs: 21-38, SEQ ID NOs: 57-62 or with the PG1 gene isgenerated.

[0231] For example, the primers used to generate the amplificationproducts may comprise the primers listed in FIG. 6A or 6B (SEQ ID NOs:39-56 and SEQ ID NOs: 63-68). FIGS. 6A and FIG. 6B provide exemplaryprimers which is used in the amplification reactions and the identitiesand locations of the polymorphic bases in the amplification productswhich are produced with the exemplary primers. The sequences of each ofthe alleles of the biallelic markers resulting from amplification usingthe primers in FIGS. 6A and 6B are listed in the accompanying SequenceListing as SEQ ID NOs:21-38 and 57-62.

[0232] The PCR primers is oligonucleotides of 10, 15, 20 or more basesin length which enable the amplification of the polymorphic site in themarkers. In some embodiments, the amplification product produced usingthese primers is at least 100 bases in length (i.e. 50 nucleotides oneach side of the polymorphic base). In other embodiments, theamplification product produced using these primers is at least 500 basesin length (i.e. 250 nucleotides on each side of the polymorphic base).In still further embodiments, the amplification product produced usingthese primers is at least 1000 bases in length (i.e. 500 nucleotides oneach side of the polymorphic base).

[0233] It will be appreciated that the primers listed in FIG. 6A and 6Bare merely exemplary and that any other set of primers which produceamplification products containing the polymorphic nucleotides of one ormore of the biallelic markers of SEQ ID NOs. 21-38 and SEQ ID NOs: 57-62or biallelic markers in linkage disequilibrium with the sequences of SEQID NOs. 21-38 and SEQ ID NOs: 57-62 or with the PG1 gene, or acombination thereof is used in the diagnostic methods.

[0234] Following the PCR amplification, the identities of thepolymorphic bases of one or more of the biallelic markers of SEQ ID NOs:21-38 and SEQ ID NOs: 57-62, or biallelic markers in linkagedisequilibrium with the sequences of SEQ ID NOs. 21-38 and SEQ ID NOs:57-62 or with the PG1 gene, or a combination thereof, are determined.The identities of the polymorphic bases is determined using themicrosequencing procedures described in Example 5 above and themicrosequencing primers listed as features in the sequences of SEQ IDNOs: 21-38 and SEQ ID NOs: 57-62. It will be appreciated that themicrosequencing primers listed as features in the sequences of SEQ IDNOs: 21-38 and SEQ ID NOs: 57-62 are merely exemplary and that anyprimer having a 3′ end near the polymorphic nucleotide, and preferablyimmediately adjacent to the polymorphic nucleotide, is used.Alternatively, the microsequencing analysis is performed as described inPastinen et al., Genome Research 7:606-614 (1997), which is described inmore detail below.

[0235] Alternatively, the PCR product is completely sequenced todetermine the identities of the polymorphic bases in the biallelicmarkers. In another method, the identities of the polymorphic bases inthe biallelic markers is determined by hybridizing the amplificationproducts to microarrays containing allele specific oligonucleotidesspecific for the polymorphic bases in the biallelic markers. The use ofmicroarrays comprising allele specific oligonucleotides is described inmore detail below.

[0236] It will be appreciated that the identities of the polymorphicbases in the biallelic markers is determined using techniques other thanthose listed above, such as conventional dot blot analyses.

[0237] Nucleic acids used in the above diagnostic procedures maycomprise at least 10 consecutive nucleotides in the biallelic markers ofSEQ ID NOs: 21-38 and SEQ ID NOs: 57-62 or the sequences complementarythereto. Alternatively, the nucleic acids used in the above diagnosticprocedures may comprise at least 15 consecutive nucleotides in thebiallelic markers of SEQ ID NOs: 21-38 and SEQ ID NOs: 57-62 or thesequences complementary thereto In some embodiments, the nucleic acidsused in the above diagnostic procedures may comprise at least 20consecutive nucleotides in the biallelic markers of SEQ ID NOs: 21-38and SEQ ID NOs: 57-62 or the sequences complementary thereto. In stillother embodiments, the nucleic acids used in the above diagnosticprocedures may comprise at least 30 consecutive nucleotides in thebiallelic markers of SEQ ID NOs: 21-38 and SEQ ID NOs: 57-62 or thesequences complementary thereto. In further embodiments, the nucleicacids used in the above diagnostic procedures may comprise more than 30consecutive nucleotides in the biallelic markers of SEQ ID NOs: 21-38and SEQ ID NOs: 57-62 or the sequences complementary thereto. In stillfurther embodiments, the nucleic acids used in the above diagnosticprocedures may comprise the entire sequence of the biallelic markers ofSEQ ID NOs: 21-38 and SEQ ID NOs: 57-62 or the sequences complementarythereto.

I.E. Identification and Sequencing, of the PG1 Gene, and Localization ofthe PG1 Protein

[0238] The above haplotype analysis indicated that 171kb of genomic DNAbetween biallelic markers 4-14 and 99-221 totally or partially containsa gene responsible for prostate cancer. Therefore, the protein codingsequences lying within this region were characterized to locate the geneassociated with prostate cancer. This analysis, described in furtherdetail below, revealed a single protein coding sequence in the 171 kb,which was designated as the PG1 gene.

[0239] Template DNA for sequencing the PG1 gene was obtained as follows.BACs 189EO8 and 463FO1 were subcloned as previously described Plasmidinserts were first amplified by PCR on PE 9600 thermocyclers(Perkin-Elmer), using appropriate primers, AmpliTaqGold (Perkin-Elmer),dNTPs (Boehringer), buffer and cycling conditions as recommended by thePerkin-Elmer Corporation.

[0240] PCR products were then sequenced using automatic ABI Prism 377sequencers (Perkin Elmer, Applied Biosystems Division, Foster City,Calif.). Sequencing reactions were performed using PE 9600 thermocyclers(Perkin Elmer) with standard dye-primer chemistry and ThermoSequenase(Amersham Life Science). The primers were labeled with the JOE, FAM, ROXand TAMRA dyes. The dNTPs and ddNTPs used in the sequencing reactionswere purchased from Boehringer. Sequencing buffer, reagentconcentrations and cycling conditions were as recommended by Amersham.

[0241] Following the sequencing reaction, the samples were precipitatedwith EtOH, resuspended in formamide loading buffer, and loaded on astandard 4% acrylamide gel. Electrophoresis was performed for 2.5 hoursat 3000 V on an ABI 377 sequencer, and the sequence data were collectedand analyzed using the ABI Prism DNA Sequencing Analysis Software,version 2.1.2.

[0242] The sequence data obtained as described above were transferred toa proprietary database, where quality control and validation steps wereperformed. A proprietary base-caller (“Trace”), working using a Unixsystem automatically flagged suspect peaks, taking into account theshape of the peaks, the inter-peak resolution, and the noise level. Theproprietary base-caller also performed an automatic trimming. Anystretch of 25 or fewer bases having more than 4 suspect peaks wasconsidered unreliable and was discarded. Sequences corresponding tocloning vector oligonucleotides were automatically removed from thesequence. However, the resulting sequence may contain 1 to 5 basesbelonging to the vector sequences at their 5′ end. If needed, these caneasily be removed on a case by case basis.

[0243] The genomic sequence of the PG1 gene is provided in theaccompanying Sequence Listing and is designated as SEQ ID NO: 1.

[0244] Potential exons in BAC-derived human genomic sequences werelocated by homology searches on protein, nucleic acid and EST (ExpressedSequence Tags) public databases. Main public databases were locallyreconstructed. The protein database, NRPU (Non-redundant Protein Unique)is formed by a non-redundant fusion of the Genpept (Benson D. A. et al.,Nucleic Acids Res. 24: 1-5 (1996), Swissprot (Bairoch, A. and Apweiler,R, Nucleic Acids Res. 24: 21-25 (1996) and PIR/NBRF (George, D. G. etal., Nucleic Acids Res. 24:17-20 (1996) databases. Redundant data wereeliminated by using the NRDB software (Benson et al., supra) andinternal repeats were masked with the XNU software (Benson et al.,supra). Homologies found using the NRPU database allowed theidentification of sequences corresponding to potential coding exonsrelated to known proteins.

[0245] The EST local database is composed by the gbest section (1-9) ofGenBank (Benson et al., supra), and thus contains all publicly availabletranscript fragments. Homologies found with this database allowed thelocalization of potentially transcribed regions.

[0246] The local nucleic acid database contained all sections of GenBankand EMBL (Rodriguez-Tome, P. et al., Nucleic Acids Res. 24: 6-12 (1996)except the EST sections. Redundant data were eliminated as previouslydescribed.

[0247] Similarity searches in protein or nucleic acid databases wereperformed using the BLAS software (Altschul, S. F. et al., J. Mol. Biol.215: 403410 (1990). Alignments were refined using the Fasta software,and multiple alignments used Clustal W. Homology thresholds wereadjusted for each analysis based on the length and the complexity of thetested region, as well as on the size of the reference database.

[0248] Potential exon sequences identified as above were used as probesto screen cDNA libraries. Extremities of positive clones were sequencedand the sequence stretches were positioned on the genomic sequence ofSEQ ID NO: 1. Primers were then designed using the results from thesealignments in order to enable the PG1 cloning procedure described below.

Cloning PG1 cDNA

[0249] PG1 cDNA was obtained as follows. 4 μl of ethanol suspensioncontaining 1mg of human prostate total RNA (Clontech laboratories, Inc.,Palo Alto, USA; catalogue N. 64038-1, lot 7040869) was centrifuged, andthe resulting pellet was air dried for 30 minutes at room temperature.

[0250] First strand cDNA synthesis was performed using the AdvantageTMRT-for-PCR kit (Clontech laboratories, Inc., Palo Alto, USA; catalogueN. K1402-1). 1 μl of 20 mM solution of primer PGRT32:TTTTTTTTTTTTTTTTTTTGAAAT (SEQ ID NO: 10) was added to 12.5 μl of RNAsolution in water, heated at 74° C. for two and a half minutes andrapidly quenched in an ice bath. 10 μl of 5xRT buffer (50 mM Tris-HCl,pH 8.3, 75 mM KCl, 3 mM MgCl2), 2.5 μl of dNTP mix (10 mM each), 1.251μl of human recombinant placental RNA inhibitor were mixed with 1 ml ofMMLV reverse transcriptase (200 units). 6.5 μl of this solution wereadded to RNA-primer mix and incubated at 42° C. for one hour. 80 μl ofwater were added and the solution was incubated at 94° C. for 5 minutes.

[0251] 5 μl of the resulting solution were used in a Long Range PCRreaction with hot start, in 50 μl final volume, using 2 units of rtTHXL,20 pmol/μl of each of GC1.5p.1: CTGTCCCTGGTGCTCCACACGTACTC (SEQ ID NO:6)or GC1.5p2 TGGTGCTCCACACGTACTCCATGCGC (SEQ ID NO: 7) and GC1.3p:CTTGCCTGCTGGAGACACAGAATTTCGATAGCAC (SEQ ID NO:9) primers with 35

[0252] cycles of elongation for 6 minutes at 67° C. in thermocycler.

[0253] The sequence of the PG1 cDNA obtained as described above (SEQ IDNO 3) is provided in the accompanying Sequence Listing. Results ofNorthern blot analysis of prostate mRNAs support the existence of amajor PG1 cDNA having a 5-6 kb length.

Characterization of the PG1 Gene

[0254] The intron/exon structure of the gene was deduced by aligning themRNA sequence from the cDNA of SEQ ID NO: 3 and the genomic DNA sequenceof SEQ ID NO: 1.

[0255] The positions of the introns and exons in the PG1 genomic DNA areprovided in FIGS. 7 and 8. FIG. 7 lists positions of the start and endnucleotides defining each of the at least 8 exons (labeled Exons A-H) inthe sequence of SEQ ID NO: 1, the locations and phases of the 5′ and 3′splice sites in the sequence of SEQ ID NO: 1, the position of the stopcodon in the sequence of SEQ ID NO: 1, and the position of thepolyadenylation site in the sequence of SEQ ID NO: 1. FIG. 8 shows thepositions of the exons within the PG1 genomic DNA and the PG1 mRNA, thelocation of a tyrosine phosphatase retro-pseudogene in the PG1 genomicDNA, the positions of the coding region in the mRNA, and the locationsof the polyadenylation signal and polyA stretch in the mRNA.

[0256] As indicated in FIGS. 7 and 8, the PG1 gene comprises at least 8exons, and spans more than 52 kb. The first intron contains a tyrosinephosphatase retropseudogene. A G/C rich putative promoter region liesbetween nucleotide 1629 and 1870 of SEQ ID NO: 1. A CCAAT box is presentat nucleotide 1661 of SEQ ID NO: 1. The promoter region was identifiedas described in Prestridge, D. S., Predicting Pol II Promoter SequencesUsing Transcription Factor Binding Sites, J. Mol. Biol. 249:923-932(1995).

[0257] It is possible that the methionine listed as being the initiatingmethionine in the PG1 protein sequence of SEQ ID NO: 4 (based on thecDNA sequence of SEQ ID NO: 3) may actually be downstream but in phasewith another methionine which acts as the initiating methionine. Thegenomic DNA sequence of SEQ ID NO: 1 contains a methionine upstream fromthe methionine at position number 1 of the protein sequence of SEQ IDNO: 4 . If the upstream methionine is in fact the authentic initiationsite, the sequence of the PG1 protein would be that of SEQ ID NO: 5.This possibility is investigated by determining the exact position ofthe 5′ end of the PG1 mRNA as follows.

[0258] One way to determine the exact position of the 5′ end of the PG1mRNA is to perform a 5′ RACE reaction using the Marathon-Ready humanprostate cDNA kit from Clontech (Catalog. No. PT1156-1). For example,the RACE reaction may employ the PG1 primers PG15RACE196CAATATCTGGACCCCGGTGTAATTCTC (SEQ ID NO: 8) as the first primer. Thesecond primer in the RACE reaction is PG15RACE130n having the sequenceGGTCGTCCAGCGCTTGGTAGAAG (SEQ ID NO: 2). The sequence analysis of theresulting PCR product, or the product obtained with other PG1 specificprimers, will give the exact sequence of the initiation point of the PG1transcript.

[0259] Alternatively, the 5′ sequence of the PG1 transcript can bedetermined by conducting a PCR amplification with a series of primersextending from the 5′ end of the presently identified coding region. Inany event, the present invention contemplates use of PG1 nucleic acidsand/or polypeptides coding for or corresponding to either SEQ ID NO: 4or SEQ ID NO: 5 or fragments thereof.

[0260] It is also possible that alternative splicing of the PG1 gene mayresult in additional translation products not described above. It isalso possible that there are sequences upstream or downstream of thegenomic sequence of SEQ ID NO: 1 which contribute to the translationproducts of the gene. Finally, alternative promoters may result in PG1derived transcripts other than those described herein.

[0261] The promoter activity of the region between nucleotides 1629 and1870 can be verified as described below. Alternatively, should thisregion lack promoter activity, the promoter responsible for drivingexpression of the PG1 gene is identified as described below.

[0262] Genomic sequences lying upstream of the PG1 gene are cloned intoa suitable promoter reporter vector, such as the pSEAP-Basic,pSEAP-Enhancer, pβgal-Basic, pβgal-Enhancer, or pEGFP-1 PromoterReporter vectors available from Clontech. Briefly, each of thesepromoter reporter vectors include multiple cloning sites positionedupstream of a reporter gene encoding a readily assayable protein such assecreted alkaline phosphatase, β galactosidase, or green fluorescentprotein. The sequences upstream of the PG1 coding region are insertedinto the cloning sites upstream of the reporter gene in bothorientations and introduced into an appropriate host cell. The level ofreporter protein is assayed and compared to the level obtained from avector which lacks an insert in the cloning site. The presence of anelevated expression level in the vector containing the insert withrespect to the control vector indicates the presence of a promoter inthe insert. If necessary, the upstream sequences can be cloned intovectors which contain an enhancer for augmenting transcription levelsfrom weak promoter sequences. A significant level of expression abovethat observed with the vector lacking an insert indicates that apromoter sequence is present in the inserted upstream sequence.

[0263] Promoter sequences within the upstream genomic DNA is furtherdefined by constructing nested deletions in the upstream DNA usingconventional techniques such as Exonuclease III digestion. The resultingdeletion fragments can be inserted into the promoter reporter vector todetermine whether the deletion has reduced or obliterated promoteractivity. In this way, the boundaries of the promoters is defined. Ifdesired, potential individual regulatory sites within the promoter isidentified using site directed mutagenesis or linker scanning toobliterate potential transcription factor binding sites within thepromoter individually or in combination. The effects of these mutationson transcription levels is determined by inserting the mutations intothe cloning sites in the promoter reporter vectors.

[0264] Sequences within the PG1 promoter region which are likely to bindtranscription factors is identified by homology to known transcriptionfactor binding sites or through conventional mutagenesis or deletionanalyses of reporter plasmids containing the promoter sequence. Forexample, deletions is made in a reporter plasmid containing the promotersequence of interest operably linked to an assayable reporter gene. Thereporter plasmids carrying various deletions within the promoter regionare transfected into an appropriate host cell and the effects of thedeletions on expression levels is assessed. Transcription factor bindingsites within the regions in which deletions reduce expression levels isfurther localized using site directed mutagenesis, linker scanninganalysis, or other techniques familiar to those skilled in the art.

[0265] The promoters and other regulatory sequences located upstream ofthe PG1 gene is used to design expression vectors capable of directingthe expression of an inserted gene in a desired spatial, temporal,developmental, or quantitative manner. For example, since the PG1promoter is presumably active in the prostate, it can be used toconstruct expression vectors for directing gene expression in theprostate.

[0266] Preferably, in such expression vectors, the PG1 promoter isplaced near multiple restriction sites to facilitate the cloning of aninsert encoding a protein for which expression is desired downstream ofthe promoter, such that the promoter is able to drive expression of theinserted gene. The promoter is inserted in conventional nucleic acidbackbones designed for extrachromosomal replication, integration intothe host chromosomes or transient expression. Suitable backbones for thepresent expression vectors include retroviral backbones, backbones fromeukaryotic episomes such as SV40 or Bovine Papilloma Virus, backbonesfrom bacterial episomes, or artificial chromosomes.

[0267] Preferably, the expression vectors also include a polyA signaldownstream of the multiple restriction sites for directing thepolyadenylation of mRNA transcribed from the gene inserted into theexpression vector.

[0268] Nucleic acids encoding proteins which interact with sequences inthe PG1 promoter is identified using one-hybrid systems such as thosedescribed in the manual accompanying the Matchmaker One-Hybrid Systemkit available from Clontech (Catalog No. K1603-1). Briefly, theMatchmaker One-hybrid system is used as follows. The target sequence forwhich it is desired to identify binding proteins is cloned upstream of aselectable reporter gene and integrated into the yeast genome.Preferably, multiple copies of the target sequences are inserted intothe reporter plasmid in tandem.

[0269] A library comprised of fusions between cDNAs to be evaluated forthe ability to bind to the promoter and the activation domain of a yeasttranscription factor, such as GAL4, is transformed into the yeast straincontaining the integrated reporter sequence. The yeast are plated onselective media to select cells expressing the selectable marker linkedto the promoter sequence. The colonies which grow on the selective mediacontain genes encoding proteins which bind the target sequence. Theinserts in the genes encoding the fusion proteins are furthercharacterized by sequencing. In addition, the inserts is inserted intoexpression vectors or in vitro transcription vectors. Binding of thepolypeptides encoded by the inserts to the promoter DNA is confirmed bytechniques familiar to those skilled in the art, such as gel shiftanalysis or DNAse protection analysis.

Analysis of PG1 Protein Sequence

[0270] The PG1 cDNA of SEQ ID NO: 3 encodes a 353 amino-acid protein(SEQ ID NO: 4). As indicated in the accompanying Sequence Listing, aProsite analysis indicated that the PG1 protein has a leucine zippermotif, a potential glycosylation site, 3 potential casein kinase IIphosphorylation sites, a potential cAMP dependent protein kinasephosphorylation site, 2 potential tyrosine kinase phosphorylation sites,4 potential protein kinase C phosphorylation sites, 5 potentialN-myristoylation sites, 1 potential tyrosine sulfation site, and onepotential amidation site.

[0271] A search for membrane associated domains was conducted accordingto the methods described in Argos, P. et al., Structural Prediction ofMembrane-bound Proteins, Elur. J. Biochem. 128:565-575 (1982); Klein etal., Biochimica & Biophysica Acta 815:468476 (1985); and Eisenberg etal., J. Mol. Biol. 179:125-142 (1984). The search revealed 5 potentialtransmembrane domains predicted to be integral membrane domains. Theseresults suggest that the PG1 protein is likely to be membrane-associatedand is an integral membrane protein.

[0272] A homology search was conducted to identify proteins homologousto the PG1 protein. Several proteins were identified which sharehomology with the PG1 protein. FIG. 9 lists the accession numbers ofseveral proteins which share homology with the PG1 protein in threeregions designated box1, box2 and box3.

[0273] It will be appreciated that each of the motifs described above isalso present in the protein of SEQ ID NO: 5, which would be produced ifby translation initiation translated from the potential upstreammethionine in the nucleic acid of SEQ ID NO: 1.

[0274] As indicated in FIG. 9, a distinctive pattern of homology to box1, box 2 (SEQ ID NOs: 11-14) and box 3 (SEQ ID NOs: 15-20) is foundamongst acyl glyerol transferases. For example, the plsC protein from E.coli (Accession Number P26647) shares homology with the box1 and box2sequences, but not the box 3 sequence, of the PG1 protein. The productof this gene transfers acyl from acyl-coenzymeA to the sn2 position of 1-Acyl-sn-glycerol-3-phosphate (lysophosphatidic acid, LPA)(Coleman J.,Mol Gen Genet. Mar. 1, 1992; 232(2): 295-303).

[0275] Box1 and box2 homologies, but not box 3 homologies, are alsofound in the SLCI gene product from baker's yeast (Accession NumberP33333) and the mouse gene AB005623. Each of these genes are able tocomplement in vivo mutations in the bacterial plsC gene. (Nagiec M M,Wells G B, Lester R L, Dickson R C, J. Biol. Chem., Oct. 15; 1993;268(29): 22156-22163, A suppressor gene that enables Saccharomycescerevisiae to grow without making sphingolipids encodes a protein thatresembles an Escherichia coli fatty acyltransferase; and Kume K, ShimizuT, Biochem. Biophys. Res. Commun. Aug. 28, 1997; 237(3): 663-666, cDNAcloning and expression of murine I -acyl-sn-glycerol-3-phosphateacyltransferase).

[0276] Recently two different human homologues of the mouse AB005623gene, Accession Numbers U89336 and U56417 were cloned and found to belocalized to human chromosomes 6 and 9 (Eberhardt. C., Gray, P. W. andTjoelker, L. W., J. Biol. Chem. 1997; 272, 20299-20305, Humanlysophosphatidic acid acyltransferase cDNA cloning, expression, andlocalization to chromosome 9q34.3; and West, J., Tompkins, C. K.,Balantac, N., Nudelman, E., Meengs, B., White, T., Bursten, S., Coleman,J., Kumar, A., Singer, J. W. and Leung, D. W, DNA Cell Biol. 6, 691-701(1997), Cloning and expression of two human lysophosphatidic acidacyltransferase cDNAs that enhance cytokine induced signaling responsesin cells).

[0277] The enzymatic acylation of LPA results in 1,2-diacyl-sn-glycerol3-phosphate, an intermediate to the biosynthesis of bothglycerophospholipids and triacylglycerol. Several important signalingmessengers participating in the transduction of mitogenic signals,induction of apoptosis, transmission of nerve impulses and othercellular responses mediated by membrane bound receptors belong to thismetabolic pathway.

[0278] LPA itself is a potent regulator of mammalian cell proliferation.In fact, LPA is one of the major mitogens found in blood serum. (For areview: Durieux M E, Lynch K R, Trends Pharmacol. Sci. Jun. 14,1993;(6):249-254, Signaling properties of lysophosphatidic acid. LPA canact as a survival factor to inhibit apoptosis of primary cells; andLevine JS, Koh JS, Triaca V, Lieberthal W, Am. J. Physiol. October 1997;273(4Pt2): F575-F585, Lysophosphatidic acid: a novel growth and survivalfactor for renal proximal tubular cells). This function of LPA ismediated by the lipid kinase phosphatidylinositol 3-kinase.

[0279] Phosphatidylinositol and its derivatives present another class ofmessengers emerging from the 1-acyl-sn-glycerol-3-phosphateacyltransferase pathway. (Toker A, Cantley L C, Nature Jun. 12, 1997;387(6634): 673-676, Signaling through the lipid products ofphosphoinositide-3-OH kinase; Martin T F, Curr. Opin. Neurobiol. June1997; 7(3):331-338, Phosphoinositides as spatial regulators of membranetraffic; and Hsuan J J, et al., Int. J. Biochem. Cell Biol. Mar. 1,1997; 29(3): 415-435, Growth factor-dependent phosphoinositidesignaling).

[0280] Cell growth, differentiation and apoptosis can be affected andmodified by enzymes involved in this metabolic pathway. Consequently,alteration of this pathway could facilitate cancer cell progression.Modulation of the activity of enzymes in this pathway using agents suchas enzymatic inhibitors could be a way to restore a normal phenotype tocancerous cells.

[0281] Ashagbley A, Samadder P, Bittman R, Erukulla R K, Byun H S,Arthur G have recently shown that ether-linked analogue oflysophosphatidic acid: 4-O-hexadecyl-3(S)-O-methoxybutanephosphonate caneffectively inhibit the proliferation of several human cancerous celllines, including DU145 line of prostate cancer origin. (Anticancer ResJuly 1996; 16(4A): 1813-1818, Synthesis of ether-linked analogues oflysophosphatidate and their effect on the proliferation of humanepithelial cancer cells in vitro).

[0282] Structural differences between the PG1 family of cellularproteins and the functionally confirmed 1-acyl-sn-glycerol-3 -phosphateacyltransferase family, evidenced by the existence of a differentpattern of homology to box3, could point to unique substrate specificityin the phospholipid metabolic pathway, to specific interaction withother cellular components or to both.

[0283] Further analysis of the function of the PG1 gene can beconducted, for example, by constructing knockout mutations in the yeasthomologues of the PG1 gene in order to elucidate the potential functionof this protein family, and to test potential substrate analogs in orderto revert the malignant phenotype of human prostate cancer cells asdescribed in Section VIII, below.

EXAMPLE 7 Analysis of the Intracellular Localisation of the PG1 Isoforms

[0284] To study the intracellular localisation of PG1 protein, differentisoforms of PG1 were cloned in the expression vector pEGFP-N1(Clontech),transfected and expressed in normal (PNT2A) or adenocarcinoma (PC3)prostatic cell line.

[0285] First, to generate cDNA inserts, 5′ and 3′ primers weresynthesised allowing to amplify different regions of the PG1 openreading frame. Respectively, these primers were designed with aninternal EcoRI or BamHI site which allowed the insertion of theamplified product into the EcoRI and BamHI sites of the expressionvector. The restriction sites were introduced into the primer so thatafter cloning into pEGFP-N1, the PG1 open reading frame would be fusedin frame, to the EGFP open reading frame. The translated protein wouldbe a fusion between PG1 and EGFP. EGFP being a variant form of the GFPprotein (Green Fluorescent Protein), it is possible to detect theintracellular localisation of the different PG1 isoforms by examiningthe fluorescence emitted by the EGFP fused protein.

[0286] The different forms that were analysed correspond either todifferent messengers identified by RT-PCR performed on total normalhuman prostatic RNA or to a truncated form resulting from a non sensemutation identified in a tumoural prostatic cell line LnCaP. Thedifferent PG1 constructions were transfected using the lipofectinetechnique and EGFP expression was examined 20 hours post transfection.

[0287] Name and description of the different forms transfected arelisted below:

[0288] A) PG1 includes all the coding exons from exon 1 to 8.

[0289] B) PG1/1-4 corresponds to an alternative messenger which is dueto an alternative splicing, joining exon 1 to exon 4, and resulting inthe absence of exons 2 and 3.

[0290] C) PG1/1-5 corresponds to an alternative messenger which is dueto an alternative splicing, joining exon 1 to exon 5, and resulting inthe absence of exons 2, 3 and 4.

[0291] D) PG1/1-7 includes exons 1 to 6, and corresponds to the mutatedform identified in genomic DNA of the prostatic tumoural cell lineLNCaP.

Cloning of the PG1 cDNA inserts in the EGFP-N1 expression vector

[0292] cDNAs from human prostate were obtained by RT-PCR using theAdvantage RT-for-PCR Kit (CLONTECH ref K1402-2). First, 1 μl ofoligodT-containing PG1 specific primer PGRT32 TTTTTTTTTTTTTTTTTTTGAAAT(20pmoles) and 11.5 μl of DEPC treated H₂O were added to 1 μl of totalmRNA (1 μg) extracted from human prostate (CLONTECH ref 64038-1). ThemRNA was heat denaturated for 2.5 min at 74° C. and then quickly chilledon ice. A mix containing 4 μl of 5×buffer, 1 μl of dNTPs (10 mM each),0.5 μl of recombinant RNase inhibitor (20U) and 1 μl of MoMuLV ReverseTranscriptase (200U) was added to the denaturated mRNA. Reversetranscription was performed for 60 min. at 42° C. Enzymes were heatdenaturated for 5 min. at 94° C. Then, 80 μl of DEPC treated H₂O wereadded to the reaction mix and the cDNA mix was stored at −20° C. PrimersPG15Eco3 (5′ CCTGAATTCCGCCGAGCTGAGAAGATGC 3′), and PG13Bam2(5′ CCTGGATCCGCTTTAATAGTAACCCACAGGCAG 3′)

[0293] were used for PCR amplification of the different PG1 cDNAs. A 50μl PCR reaction mix containing 5 μl of the previously prepared prostatecDNA mix, 15 μl of 3.3×PCR buffer, 4 μl of dNTPs (2.5 mM each), 20pmoles of primer PG15Eco3, 20 pmoles of primer PG13Bam2, 1 μl of RtthXLenzyme, 2.2 μl Mg(OAc)₂ (Hot Start) was set up and amplification wasperformed for 35 cycles of 30 sec at 94° C., 10 min. at 72° C., 4 min.at 67° C. after an initial denaturation step of 10 min. at 94° C. Sizeand integrity of the PCR product was assessed by migration on a 1%agarose gel. 2, μg of the amplification product were digested with 2.4units of EcoRI (PROMEGA ref R601A) and 2.0 units of BamHI (PROMEGA refR602A) in 50 μl of 1×Multicore buffer for 2 hours at 37° C. Enzymes werethen heat inactivated for 20 min, at 68° C., DNA was phenol/chloroformextracted and ethanol-precipitated and its concentration was estimatedby migration on a 1% agarose gel.

[0294] To prepare the vector, 2 μg of pEGFP-N1 vector (CLONTECH ref6085-1) were digested with 2.4 units of EcoRI (PROMEGA ref R601A) and2.0 units of BamHI (PROMEGA ref R602A) in 50 μl of 1×Multicore bufferfor 2 hours at 37° C. Enzymes were then heat inactivated for 20 min, at68° C., DNA was phenol/chloroform extracted and ethanol-precipitated andits concentration and integrity were estimated by migration on a 1%agarose gel. 20ng of the BamHI and EcoRI digested pEGFP-N1 vector wereadded to 50 ng of BamHI-EcoRI digested PG1 cDNAs. Ligation was performedover night at 13° C. using 0.5units of T4 DNA ligase (BOEHRINGER ref84333623) in a final volume of 20 μl containing 1×ligase buffer. Theligation reaction mix was desalted by dialysis against water (MILLIPOREref VSWP01300) for 30min. at room temperature. One fifth of the desaltedligation reaction was electroporated in 25 μl of competent cellsElectroMAX DH10B (GIBCO BRL ref 18290-015) using a resistance of 126Ohms, capacitance of 50 μF, and voltage of 2.5 KV. Bacteria were thenincubated in 500 μl of SOB medium for 30min at 37° C. One fifth wasplated on LB AGAR containing 40 μg/μl KANAMYCINE (SIGMA ref K4000) andincubated over night at 37° C. Plasmid DNA was prepared from anovernight liquid culture of individual colonies and sequenced. Among thedifferent forms identified 3 were used:

[0295] A) PG1 which includes all the coding exons from exon 1 to 8.

[0296] B) PG1/1-4 which corresponds to an alternative messenger which isdue to an alternative splicing, joining exon 1 to exon 4, and resultingin the absence of exons 2 and 3.

[0297] C) PG1/1-5 which corresponds to an alternative messenger which isdue to an alternative splicing, joining exon 1 to exon 5, and resultingin the absence of exons 2, 3 and 4.

[0298] D) Vector PG1/1-7: A cDNA insert encoding for a truncated proteinwas synthesized by PCR amplification, using primers PG15Eco3 andPG1mut29Bam (5′ CCTGGATCCCCTCCATCGTCTTTCCCTT 3′) and vector PG1 as atemplate. The resulting PCR product was cloned following the sameprotocol as described above.

Transfection of the PG1 expression vectors in human prostate cell lines.

[0299] The DNA/lipofectin solution was prepared as followed: 1.5 μl oflipofectin (GIBCO BRL ref 18292-011) was diluted in 100 μl of OPTI-MEMmedium (GIBCO BRL ref 31985-018), and incubated for 30 min. at roomtemperature before being mixed to 0.5 μg of vector diluted in 100 μl ofOPTI-MEM medium and incubated for 15 min. at room temperature. Cellswere inoculated in RPM11640 medium (Gibco BRL ref 61870-010) containing5% fetal calf serum (Dutscher ref P30-3302) on slides (NUNC Lab-Tek ref177402A) and grown at 37° C. in 5%CO₂ Cells reaching 40-60% confluencywere rinsed with 300 μl OPTI-MEM medium and incubated with theDNA/lipofectin solution for 6 hours at 37° C. The medium containing DNAwas replaced by medium supplemented in fetal calf serum and cells wereincubated for at least 36 hours at 37° C. Slides were rinsed in PBS andcells were fixed in ethanol, treated with Propidium iodide, and examinedwith a fluorescence microscope using a double-pass filter set forFITC/PI.

[0300] After transfection of PG1 and PG1/1-4 in both the normal andtumoural prostatic cell line, green fluorescence was detected into andaround the nucleus (FIGS. 10 and 11). This result shows that the PG1protein is localised in the nucleus and/or the nuclear membrane.Furthermore, it suggests that exons 2 and 3 are dispensable fortranslocation of PG1 to the nucleus. In addition, no difference in theintracellular localisation of these two forms was detected between thetumoral and the normal prostatic cell line.

[0301] On the contrary, transfection experiments using PG1/1-5 show thatthis form is cytoplasmic in the normal prostatic cell line PNT2A. Itsuggests that exon 4 might be important for the regulation of thetranslocation to the nucleus. Interestingly, similar transfectionexperiments in the tumoral cell line PC3 show that PG 1/1-5 remainsnuclear and or perinuclear (FIG. 12). This result shows that there is anabnormality in the regulation of the intracellular localization of thePG1 isoforms in this tumoral cell line. Furthermore, it indicates thatthe normal function of PG1 can be altered indirectly in prostatic tumorsby an abnormality in the regulation of its intracellular location.

[0302] Finally, a non-sense mutation has been identified in theprostatic tumoural cell line LNCaP, in exon 6 of PG1 (SEQ ID NO: 69).This mutation is responsible for the production of a truncated protein(SEQ ID NO: 70). To determine the intracellular location of thistruncated protein, PG1/1-7 and PG1 were transfected in the normalprostatic cell line PNT2A. Comparison of the fluorescence detected inboth sets of experiments clearly showed that the truncated form waslocalised in the cytoplasm as the non-truncated protein was located inand/or around the nucleus (FIG. 13). This result indicates that thismutated PG1 is translated in a truncated protein which is unable toreach the nucleus. It also suggests that exons 7 and 8 may play animportant role in the regulation of the intracellular localisation ofPG1. Furthermore, it supports the previous hypothesis that an alteredregulation of PG 1 intracellular localisation might be involved inprostate tumorigenesis. pEGFP N1 PG1 PG11-4 PG11-5 PG11-7 TransfectionNA nuclear nuclear ND ND PNT2 06/17/98 Transfection cytoplasmic nuclearnuclear ND ND PNT2 06/30/98 Transfection cytoplasmic NA NA cytoplasmicND PNT2 07/16/98 Transfection NA nuclear nuclear nuclear ND PC3 07/16/98Transfection cytoplasmic nuclear NA NA ND PC3 07/16/98 bis Transfectioncytoplasmic nuclear nuclear nuclear NA PC3 08/27/98 Transfectioncytoplasmic nuclear NA cytoplasmic cytoplasmic PNT2 08/28/98 All exonsX2-3 Spliced out X2-3-4 Spliced out mut aa229

Alternative Splice Species

[0303] Alternative splicing is a common natural tool for the inhibitionof function of full length gene products. Alternative splicing is knownto result in enzyme isoforms, possesing different kiniteccharacteristics (pyruvate kinase: M1 and M2 Yamada K, Noguchi T, BiochemJ. Jan. 1, 1999; 337(Pt1):1-11. Estrogen receptor (ER) gene is known topossess variant splicing yelding the deletions of exon 3, 5, or 7. Thetruncated ER protein induced from variant mRNA could mainly be exhibitedas a repressor through dominant negative effects on normal ER protein(Iwase H, Omoto Y, Iwata H, Hara Y, Ando Y, Kobayashi S, OncologyDecember 1998;55 Suppl S1:11-16)′ Yu et al (Yu J J, Mu C, Dabholkar M,Guo Y, Bostick-Bruton F, Reed E,Int J Mol Med 1998 Mar;1(3):617-620)demonstrated that there is an association between alternative splicingof ERCC1, and reduction in cellular capability to repair cisplatin-DNAadduct. Munoz-Sanjuan et al (Munoz-Sanjuan I, Simandl B K, Fallon J F,Nathans J, Development Dec. 14, 1998; 126(Pt 2):409-421) demonstratedexistence of two differentially spliced isoforms of fibroblast growthfactor(FGF) type two genes that are present in non-overlapping spatialdistributions in the neural tube and adjacent structures in developingchiken embryo. One of these forms is secreted and activates theexpression of HoxD13, HoxD 11, Fgf-4 and BMP-2 ectopically, consistentwith cFHF-2 playing a role in anterior-posterior patterning of the limb.

[0304] The CD44 is a cell adhesion molecule that is present as numerousisoforms created by mRNA alternative splicing. Expression of variantisoforms of CD44 is associated with tumor growth and metastasis.(ShibuyaY, Okabayashi T, Oda K, Tanaka N,Jpn J Clin Oncol October1998;28(10):609-14) the showed that ratio of two particular isoforms isa useful indicator of prognosis in gastric and colorectal carcinoma.Zhang Y F et al (Zhang Y F, Jeffrey S, Burchill S A, Berry P A, Kaski JC, Carter N D, Br J Cancer November 1998; 78(9):1141-6° showed thathuman endothelin receptor A is the subject to alternative spicing givingat least two isoforms. The truncated receptor was expressed in alltissues and cells examined, but the level of expression varied. Inmelanoma cell lines and melanoma tissues, the truncated receptor genewas the major species, whereas the wild-type ETA was predominant inother tissues. Zhang et al. conclude that the function and biologicalsignificance of this truncated ETA receptor is not clear, but it mayhave regulatory roles for cell responses to ETs.

EXAMPLE 8 Identification of PG1 Alternative Splice Species

[0305] The PG1 cDNA was first cloned by screening of a human prostatecDNA library. Sequence analysis of about 400 cDNA clones showed that atleast 14 isoforms were present in this cDNA library. Comparison of theirsequences to the genomic sequence showed that these isoforms resultedfrom a complex set of different alternative splicing events betweennumerous exons (FIG. 14).

[0306] To rule out the possibility of a cloning artefact generatedduring the cDNA library construction and to systematically identify allexisting alternative splice junctions, RT-PCR experiments were performedon RNA of normal prostate as well as normal prostatic cell lines PNT1A,PNT1B and PNT2 using all the possible combinations of primers specificto the different exon borders SEQ ID NOs: 137-178. The presence ofmultiple PCR bands in each reaction was assessed by migration in anagarose gel. Each band was analysed by sequencing, and the presence orabsence of specific splicing events, as seen in the sequence by aspecific splice junction, was scored as plus or minus in FIG. 15.

[0307] Furthermore, to identify aberrant splicing event in prostatetumors, similar experiments were performed on RNA extracted from tumoralprostatic cell lines LnCaP (obtained from two different sources andnamed FCG and JMB), CaHPV, Du145 and PC3 as well as on RNA obtained fromprostate tumors (ECP5 to ECP24).

[0308] As shown in the first five columns, all isoforms identified inthe cDNA library were detected in RNA of normal prostate, normalprostatic cell lines or prostate tumors. In addition to the differentsplice junctions detected in the cDNA library, 19 other splice junctionswere detected in normal prostate or in normal prostatic cell lines. Twotypes of exon junctions (exons 3-7, exons 3 b-8) were never detected ineither normal prostate, normal prostatic cell lines, prostate tumors orprostatic tumoral cell lines. Comparison between normal and tumoralsamples showed the presence of 2 additional exon junctions (exons 3-8,exons 5-8) in the tumoral samples that were not detected previously inthe normal samples. This result demonstrate that during tumorigenesis,the complex regulation of the PG1 splicing has been altered, resultingin an abnormal ratio of the different isoforms. It is of a specificinterest since it has been shown in patients with a geneticpredisposition to Wilms tumor, that an imbalance between different RNAisoforms might be involved in tumorigenesis (Bickmore et al., Science1992, 257:325-7; Little et al, Hum Mol Genet 1995, 4:351-8).

[0309] Interestingly, comparison between normal and tumoral samples,also showed that some exon junctions are present in all normal samples,but are absent in numerous tumoral samples. It further indicates thatthe normal function of PG1 can be altered by an abnormality in theregulation of PG1 splicing and further support the previous hypothesis.

[0310] Furthermore, comparison between the different types of normalsamples (Col.2 versus Col. 3, 4 and 5) also showed differences in thepresence or absence of some exon junctions. It indicates that thetransformation process necessary to the generation of these normalprostatic cell lines might result in similar alteration which furthersupport the previous hypothesis.

EXAMPLE 9 Determining the Tumor Suppressor Activity of the PG1 GeneProduct, Mutants and Other PG1 Polypeptides

[0311] PG1 variants which results from either alternate splicing of thePG1 mRNA or from mutation of PG1 that introduce a stop codon (nucleotideof SEQ ID NO: 69 and protein of SEQ ID NO: 70) can no longer perform itsrole of tumor suppressor. It is possible and even likely that PG1 tumorsuppressor role extends beyond prostate cancer to other form ofmalignancies. PG1 therefore represent a prime candidate for gene therapyof cancer by creating a targeting vector which knocks out the mutantand/or introduces a wild-type PG1 gene (e.g. SEQ ID NO 3 or 179) or afragment thereof.

[0312] To validate this model, PG1 and its alternatively spliced ormutated variants are stably transfected in tumor cell line using methodsdescribed in Section VIII. The efficiency of transfection is determinedby northern and western blotting; the latter is performed usingantibodies prepared against PG1 synthetic peptides designed todistinguish the product of the most abundant PG1 mRNA from thealternatively spliced variants, the truncated variant, or otherfunctional mutants. The production of synthetic peptides and ofpolyclonal antibodies is performed using the methods described herein inSections III and VII. After demonstrating that PG1 and its variant areefficiently expressed in various tumor cell line preferably derived fromhuman prostate cancer, hepatocarcinoma, lung and colon carcinoma; we theeffect of this gene on the rate of cell division, DNA synthesis, abilityto grow in soft agar and ability to induce tumor progression andmetastasis when injected in immunologically deficient nude mice aredetermined.

[0313] Alternatively the PG1 gene and its variant are inserted inadenoviruses that are used to obtain a high level of expression of thesegenes. This method is preferred to test the effect of PG1 expression inanimal that are spontaneously developing tumor. The production ofspecific adenoviruses is obtained using methods familiar to those withnormal skills in cell and molecular biology.

[0314] II. POLYNUCLEOTIDES

[0315] The present invention encompasses polynucleotides in the form ofPG1 genomic or cDNA as well as polynucleotides for use as primers andprobes in the methods of the invention. These polynucleotides mayconsist of, consist essentially of, or comprise a contiguous span ofnucleotides of a sequence from any sequence in the Sequence Listing aswell as sequences which are complementary thereto (“complementsthereof”). Preferably said sequence is selected from SEQ ID NOs: 3,112-125, 179, 182-184. The “contiguous span” is at least 6, 8, 10, 12,15, 20, 25, 30, 50, 100, 200, or 500 nucleotides in length. It should benoted that the polynucleotides of the present invention are not limitedto having the exact flanking sequences surrounding the polymorphic baseswhich are enumerated in Sequence Listing. Rather, it will be appreciatedthat the flanking sequences surrounding the biallelic markers, or any ofthe primers of probes of the invention which are more distant from abiallelic markers, is lengthened or shortened to any extent compatiblewith their intended use and the present invention specificallycontemplates such sequences. It will be appreciated that thepolynucleotides referred to in the Sequence Listing is of any lengthcompatible with their intended use. Also the flanking regions outside ofthe contiguous span need not be homologous to native flanking sequenceswhich actually occur in humans. The addition of any nucleotide sequence,which is compatible with the nucleotides intended use is specificallycontemplated. The contiguous span may optionally include the PG1-relatedbiallelic marker in said sequence. Optionally either allele of thebiallelic markers described above in the definition of PG1-relatedbiallelic marker is specified as being present at the PG 1-relatedbiallelic marker.

[0316] The invention also relates to polynucleotides that hybridize,under conditions of high or intermediate stringency, to a polynucleotideof a sequence from any sequence in the Sequence Listing as well assequences, which are complementary thereto. Preferably said sequence isselected from SEQ ID NOs: 3, 112-125, 179, 182-184. Preferably suchpolynucleotides is at least 6, 8, 10, 12, 15, 20, 25, 30, 35, 40, 50,60, 70, 80, 90, 100, 200, or 500 nucleotides in length. Preferredpolynucleotides comprise an PG1-related biallelic marker. Optionallyeither allele of the biallelic markers described above in the definitionof PG1-related biallelic marker is specified as being present at thebiallelic marker site. Conditions of high and intermediate stringencyare further described in Section X.C.4, below.

[0317] The invention embodies polynucleotides which encode an entirehuman, mouse or mammalian PG1 protein, or fragments thereof. Generallythe polynucleotides of the invention comprise the naturally occurringnucleotide sequence of the PG1. However, any naturally occurring silentcodon variation or other silent codon variation can be employed toencode the PG1 amino acids sequence. As for those amino acids which arechanged or added to the PG1 gene for any embodiment of the inventionwhich requires the expression of a nucleotide sequence, the nucleic acidsequences generally will be chosen to optimize expression in thespecific human or non-human animal system in which the polynucleotide isintended to be used, making use of known codon preferences. The PG1polynucleotides of the invention can be the native nucleotide sequencewhich encodes a human, mouse, or mammalian PG1 protein, preferably thePG1 polynucleotide sequence of SEQ ID NOs: 3, 112-125, 179, 182-184, andthe compliments thereof. The polynucleotides of the invention includethose which encode PG1 polypeptides with a contiguous stretch of atleast 8, 10, 12, 15, 20, 25, 30, 50, 100 or 200 amino acids from SEQ IDNOs: 4, 5, 70, 74, and 125-136, as well as any other human, mouse ormammalian PG1 polypeptide. In addition the present invention encompassespolynucleotides which comprise a contiguous stretch of at least 8, 10,12, 15, 20, 25, 30, 50, 100, 200, 500 nucleotides of a human, mouse ormammalian PG1 genomic sequence as well as complete human, mouse, ormammalian PG1 genes, preferably of SEQ ID NOs: 179, 182, 183, and thecompliments thereof.

[0318] The present invention encompasses polynucleotides which consistof, consist essentially of, or comprise a contiguous stretch of at least8, 10, 12, 15, 20, 25, 30, 50, 100, 200, or 500 nucleotides of a human,mouse or mammalian PG1 cDNA sequences as well as an entire human, mouse,or mammalian PG1 cDNA. The cDNA species and polynucleotide fragmentscomprised by the polynucleotides of the invention include thepredominant species derived from any human, mouse or mammal source,preferably SEQ ID NOs: 3, 184, and the compliments thereof. In addition,the polynucleotides of the invention comprise cDNA species, andfragments thereof, that result from the alternative splicing of PG1transcripts in any human, mouse or other mammal, preferably the cDNAspecies of SEQ ID NOs: 112-124, and compliments thereof. Moreover, theinvention encompasses cDNA species and other polynucleotides whichconsist of or comprise the polynucleotides which span a splice junction,preferably including any one of SEQ ID NOs: 137 to 178, and thecompliments thereof, more preferably any one of SEQ ID NOs: 137 to 149,151 to 169, 171 to 178, and the compliments thereof. The polynucleotidesof the invention also include cDNA and other polynucleotides whichcomprise two covalently linked PG1 exons, derived from a single human,mouse or mammalian species, immediately adjacent to one another in theorder shown, and selected from the following pairs of PG1 exons: 1:2,1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 2:3, 2:4, 2:5, 2:6, 2:7, 2:8, 3:4, 3:5,3:6, 3:7, 3:8, 4:5, 4:6, 4:7, 4:8, 5:6, 5:7, 5:8, 6:7, 6:8, 7:8, 1:1bis,1bis:2, 1bis:3, 1bis:4, 1bis:5, 1bis:6, 1bis:7, 1bis:8, 3:3bis, 3bis:4,3bis:5, 3bis:6, 3bis:7, 3bis:8, 5:5, 5bis:8, 1:6bis, 2:6bis, 3:6bis,4:6bis, 5:6bis, 6bis:7, 6bis:8, and the compliments thereof. In apreferred embodiment the sequences of the PG1 exons in each of the pairsof exons is selected as follows:

[0319] exon 1-SEQ ID NO: 100; exon 2-SEQ ID NO: 101; exon 3-SEQ ID NO:102;

[0320] exon 4-SEQ ID NO: 103; exon 5-SEQ ID NO: 104; exon 6-SEQ ID NO:105;

[0321] exon 7-SEQ ID NO: 106; exon 8-SEQ ID NO: 107; exon 1bis-SEQ IDNO: 108;

[0322] exon 3bis-SEQ ID NO: 109; exon 5bis-SEQ ID NO: 110; and exon 6bis-SEQ ID NO: 111. Because of the 8 different polyadenylation sites inexon 8, any cDNA or polynucleotide of the invention comprising a humancDNA fragment encompassing exon 8 is truncated such that only the first330 nucleotides, 699 nucleotides, 833 nucleotides, 1826 nucleotides,2485 nucleotides, 2805 nucleotides, 4269 nucleotides or 4315 nucleotidesof exon 8 shown in SEQ ID NO: 107 are present.

[0323] The primers of the present invention is designed from thedisclosed sequences for any method known in the art. A preferred set ofprimers is fashioned such that the 3′ end of the contiguous span ofidentity with the sequences of the Sequence Listing is present at the 3′end of the primer. Such a configuration allows the 3′ end of the primerto hybridize to a selected nucleic acid sequence and dramaticallyincreases the efficiency of the primer for amplification or sequencingreactions. Allele specific primers is designed such that a biallelicmarker is at the 3′ end of the contiguous span and the contiguous spanis present at the 3′ end of the primer. Such allele specific primerstend to selectively prime an amplification or sequencing reaction solong as they are used with a nucleic acid sample that contains one ofthe two alleles present at a biallelic marker. The 3′ end of primer ofthe invention is located within or at least 2, 4, 6, 8, 10, 12, 15, 18,20, 25, 50, 100, 250, 500, or 1000 nucleotides upstream of anPG1-related biallelic marker in said sequence or at any other locationwhich is appropriate for their intended use in sequencing, amplificationor the location of novel sequences or markers.

[0324] Preferred amplification primers include the polynucleotidesdisclosed in SEQ ID NOs: 39-56, and 63-68. Additional preferredamplification primers for particular non-genic PG1-related biallelicmarkers are listed as follows by the internal reference number for themarker and the SEQ ID NOs for the PU and RP amplification primersrespectively:

[0325] 4-14-107 use SEQ ID NOs 339 and 382; 4-14-317 use SEQ ID NOs 339and 382;

[0326] 4-14-35 use SEQ ID NOs 339 and 382; 4-20-149 use SEQ ID NOs 340and 383;

[0327] 4-22-174 use SEQ ID NOs 341 and 384; 4-22-176 use SEQ ID NOs 341and 384;

[0328] 4-26-60 use SEQ ID NOs 342 and 385; 4-26-72 use SEQ ID NOs 342and 385;

[0329] 4-3-130 use SEQ ID NOs 343 and 386; 4-38-63 use SEQ ID NOs 344and 387;

[0330] 4-38-83 use SEQ ID NOs 344 and 387; 44-152 use SEQ ID NOs 345 and388;

[0331] 4-4-187 use SEQ ID NOs 345 and 388; 4-4-288 use SEQ ID NOs 345and 388;

[0332] 4-42-304 use SEQ ID NOs 346 and 389; 442-401 use SEQ ID NOs 346and 389;

[0333] 4-43-328 use SEQ ID NOs 347 and 390; 443-70 use SEQ ID NOs 347and 390;

[0334] 4-50-209 use SEQ ID NOs 348 and 391; 4-50-293 use SEQ ID NOs 348and 391;

[0335] 4-50-323 use SEQ ID NOs 348 and 391; 4-50-329 use SEQ ID NOs 348and 391;

[0336] 4-50-330 use SEQ ID NOs 348 and 391; 4-52-163 use SEQ ID NOs 349and 392;

[0337] 4-52-88 use SEQ ID NOs 349 and 392; 4-53-258 use SEQ ID NOs 350and 393;

[0338] 4-54-283 use SEQ ID NOs 351 and 394; 4-54-388 use SEQ ID NOs 351and 394;

[0339] 4-55-70 use SEQ ID NOs 352 and 395; 4-55-95 use SEQ ID NOs 352and 395;

[0340] 4-56-159 use SEQ ID NOs 353 and 396; 4-56-213 use SEQ ID NOs 353and 396;

[0341] 4-58-289 use SEQ ID NOs 354 and 397; 4-58-318 use SEQ ID NOs 354and 397;

[0342] 4-60-266 use SEQ ID NOs 355 and 398; 4-60-293 use SEQ ID NOs 355and 398;

[0343] 4-84-241 use SEQ ID NOs 356 and 399; 4-84-262 use SEQ ID NOs 356and 399;

[0344] 4-86-206 use SEQ ID NOs 357 and 400; 4-86-309 use SEQ ID NOs 357and 400;

[0345] 4-88-349 use SEQ ID NOs 358 and 401; 4-89-87 use SEQ ID NOs 359and 402;

[0346] 99-123-184 use SEQ ID NOs 360 and 403; 99-128-202 use SEQ ID NOs361 and 404;

[0347] 99-128-275 use SEQ ID NOs 361 and 404; 99-128-313 use SEQ ID NOs361 and 404;

[0348] 99-128-60 use SEQ ID NOs 361 and 404; 99-12907-295 use SEQ ID NOs362 and 405;

[0349] 99-130-58 use SEQ ID NOs 363 and 406; 99-134-362 use SEQ ID NOs364 and 407;

[0350] 99-140-130 use SEQ ID NOs 365 and 408; 99-1462-238 use SEQ ID NOs366 and 409;

[0351] 99-147-181 use SEQ ID NOs 367 and 410; 99-1474-156 use SEQ ID NOs368 and 411;

[0352] 99-1474-359 use SEQ ID NOs 368 and 411;

[0353] 99-1479-158 use SEQ ID NOs 369 and 412;

[0354] 99-1479-379 use SEQ ID NOs 369 and 412; 99-148-129 use SEQ ID NOs370 and 413;

[0355] 99-148-132 use SEQ ID NOs 370 and 413; 99-148-139 use SEQ ID NOs370 and 413;

[0356] 99-148-140 use SEQ ID NOs 370 and 413; 99-148-182 use SEQ ID NOs370 and 413;

[0357] 99-148-366 use SEQ ID NOs 370 and 413; 99-148-76 use SEQ ID NOs370 and 413;

[0358] 99-1480-290 use SEQ ID NOs 371 and 414;

[0359] 99-1481-285 use SEQ ID NOs 372 and 415;

[0360] 99-1484-101 use SEQ ID NOs 373 and 416;

[0361] 99-1484-328 use SEQ ID NOs 373 and 416;

[0362] 99-1485-251 use SEQ ID NOs 374 and 417;

[0363] 99-1490-381 use SEQ ID NOs 375 and 418;

[0364] 99-1493-280 use SEQ ID NOs 376 and 419; 99-151-94 use SEQ ID NOs377 and 420;

[0365] 99-211-291 use SEQ ID NOs 378 and 421; 99-213-37 use SEQ ID NOs379 and 422;

[0366] 99-221-442 use SEQ ID NOs 380 and 423; 99-222-109 use SEQ ID NOs381 and 424; and the compliments thereof.

[0367] Primers with their 3′ ends located 1 nucleotide upstream ordownstream of a PG1-related biallelic marker have a special utility inmicrosequencing assays. Preferred microsequencing primers include thepolynucleotides from position 1 to position 23 and from position 25 toposition 47 of SEQ ID NOs: 21-38, and as well as the complimentsthereof. Additional preferred microsequencing primers for particularnon-genic PG1-related biallelic markers are listed as follows by theinternal reference number for the marker and the SEQ ID NOs of the twopreferred microsequencing primers:

[0368] 4-14-107 of SEQ ID NOs 425 and 502*; 4-14-317 of SEQ ID NOs 426and 503*;

[0369] 4-14-35 of SEQ ID NOs 427 and 504*; 4-20-149 of SEQ ID NOs 428*and 505;

[0370] 4-20-77 of SEQ ID NOs 429 and 506; 4-22-174 of SEQ ID NOs 430*and 507;

[0371] 4-22-176 of SEQ ID NOs 431 and 508; 4-26-60 of SEQ ID NOs 432 and509*;

[0372] 4-26-72 of SEQ ID NOs 433 and 510; 4-3-130 of SEQ ID NOs 434 and511*;

[0373] 4-38-63 of SEQ ID NOs 435 and 512; 4-38-83 of SEQ ID NOs 436 and513*;

[0374] 4-4-152 of SEQ ID NOs 437 and 514; 4-4-187 of SEQ ID NOs 438* and515;

[0375] 4-4-288 of SEQ ID NOs 439 and 516; 4-42-304 of SEQ ID NOs 440 and517;

[0376] 4-42-401 of SEQ ID NOs 441* and 518; 4-43-328 of SEQ ID NOs 442and 519;

[0377] 4-43-70 of SEQ ID NOs 443* and 520; 4-50-209 of SEQ ID NOs 444*and 521;

[0378] 4-50-293 of SEQ ID NOs 445* and 522; 4-50-323 of SEQ ID NOs 446*and 523;

[0379] 4-50-329 of SEQ ID NOs 447* and 524; 4-50-330 of SEQ ID NOs 448and 525;

[0380] 4-52-163 of SEQ ID NOs 449* and 526; 4-52-88 of SEQ ID NOs 450*and 527;

[0381] 4-53-258 of SEQ ID NOs 451 and 528*;4-54-283 of SEQ ID NOs 452*and 529;

[0382] 4-54-388 of SEQ ID NOs 453 and 530; 4-55-70 of SEQ ID NOs 454 and531;

[0383] 4-55-95 of SEQ ID NOs 455* and 532; 4-56-159 of SEQ ID NOs 456*and 533;

[0384] 4-56-213 of SEQ ID NOs 457 and 534; 4-58-289 of SEQ ID NOs 458*and 535;

[0385] 4-58-318 of SEQ ID NOs 459* and 536; 4-60-266 of SEQ ID NOs 460*and 537;

[0386] 4-60-293 of SEQ ID NOs 461* and 538; 4-84-241 of SEQ ID NOs 462and 539*;

[0387] 4-84-262 of SEQ ID NOs 463 and 540; 4-86-206 of SEQ ID NOs 464and 541*;

[0388] 4-86-309 of SEQ ID NOs 465 and 542; 4-88-349 of SEQ ID NOs 466and 543.;

[0389] 4-89-87 of SEQ ID NOs 467* and 5443; 99-123-184 of SEQ ID NOs 468and 545;

[0390] 99-128-202 of SEQ ID NOs 469 and 546; 99-128-275 of SEQ ID NOs470 and 547;

[0391] 99-128-313 of SEQ ID NOs 471 and 548; 99-128-60 of SEQ ID NOs472* and 549;

[0392] 99-12907-295 of SEQ ID NOs 473 and 550*;

[0393] 99-1320-58 of SEQ ID NOs 474* and 551*;

[0394] 99-134-362 of SEQ ID NOs 475 and 552*; 99-140-130 of SEQ ID NOs476* and 553*;

[0395] 99-1462-238 of SEQ ID NOs 477* and 554; 99-147-181 of SEQ ID NOs478 and 555*;

[0396] 99-1474-156 of SEQ ID NOs 479 and 556*; 99-1474-359 of SEQ ID NOs480 and 557;

[0397] 99-1479-158 of SEQ ID NOs 481* and 558; 99-1479-379 of SEQ ID NOs482 and 559;

[0398] 99-148-129 of SEQ ID NOs 483 and 560; 99-148-132 of SEQ ID NOs484 and 561;

[0399] 99-148-139 of SEQ ID NOs 485 and 562; 99-148-140 of SEQ ID NOs486 and 563;

[0400] 99-148-182 of SEQ ID NOs 487 and 564*; 99-148-366 of SEQ ID NOs488 and 565;

[0401] 99-148-76 of SEQ ID NOs 489 and 566; 99-1480-290 of SEQ ID NOs490 and 567*;

[0402] 99-1481-285 of SEQ ID NOs 491 and 568*; 99-1484-101 of SEQ ID NOs492 and 569;

[0403] 99-1484-328 of SEQ ID NOs 493* and 570;

[0404] 99-1485-251 of SEQ ID NOs 494 and 571*;

[0405] 99-1490-381 of SEQ ID NOs 495* and 572;

[0406] 99-1493-280 of SEQ ID NOs 496 and 573*;

[0407] 99-151-94 of SEQ ID NOs 497 and 574*; 99-211-291 of SEQ ID NOs498* and 575;

[0408] 99-213-37 of SEQ ID NOs 499 and 576; 99-221-442 of SEQ ID 500 and577;

[0409] 99-222-109 of SEQ ID NOs 501* and 578; and compliments thereof.

[0410] Additional preferred microsequencing primers for particular genicPG1-related biallelic markers include a polynucleotide selected from thegroup consisting of the nucleotide sequences from position N−X toposition N−1 of SEQ ID NO: 179, nucleotide sequences from position N+1to position N+X of SEQ ID NO: 179, and the compliments thereof, whereinX is equal to 15, 18, 20, 25, 30, or a range of 15 to 30, and N is equalto one of the following values: 2159; 2443; 4452; 5733; 8438; 11843;1983; 12080; 12221; 12947; 13147; 13194; 13310; 13342; 13367; 13594;13680; 13902; 16231; 16388; 17608; 18034; 18290; 18786; 22835; 22872;25183; 25192; 25614; 26911; 32703; 34491; 34756; 34934; 5160; 39897;40598; 40816; 40947; 45783; 47929; 48206; 48207; 49282; 50037; 50054;50101; 50220; 50440; 50562; 50653; 50660; 50745; 50885; 51249; 51333;51435; 51468; 51515; 51557; 51566; 51632; 51666; 52016; 52096; 52151;52282; 52348; 52410; 52580; 52712; 52772; 52860; 53092; 53272; 53389;53511; 53600; 53665; 53815; 54365; and 54541.

[0411] The probes of the present invention is designed from thedisclosed sequences for any method known in the art, particularlymethods which allow for testing if a particular sequence or markerdisclosed herein is present. A preferred set of probes is designed foruse in the hybridization assays of the invention in any manner known inthe art such that they selectively bind to one allele of a biallelicmarker, but not the other under any particular set of assay conditions.Preferred hybridization probes may consists of, consist essentially of,or comprise a contiguous span which ranges in length from 8, 10, 12, 15,18 or 20 to 25, 35, 40, 50, 60, 70, or 80 nucleotides, or be specifiedas being 12, 15, 18, 20, 25, 35, 40, or 50 nucleotides in length andincluding a PG1-related biallelic marker of said sequence. Optionallyeither of the two alleles specified in the definition of PG1-realtedbiallelic marker is specified as being present at the biallelic markersite. Optionally, said biallelic marker is within 6, 5, 4, 3, 2, or 1nucleotides of the center of the hybridization probe or at the center ofsaid probe. A preferred set of hybridization probes is disclosed in SEQID NOs: 21-38, 57-62, 185-338, and the compliments thereof. Anotherparticularly preferred set of hybridization probes includes thepolynucleotides from position X to position Y of any one of SEQ ID NOs:21-38, 57-62, 185-338, or the compliments thereof, wherein X is equal to5, 8, 10, 12, 14, 16, 18 or a range of 5 to 18, and Y is equal to 30,32, 34, 36, 38, 40, 43 or a range of 30 to 43; preferably X equals 12and Y equals 36. Additional preferred hybridization probes forparticular genic P1-related biallelic markers include a polynucleotideselected from the group consisting of the nucleotide sequences fromposition N−X to position N+Y of SEQ ID NO: 179, and the complimentsthereof, wherein X is equal to 8, 10, 12, 15, 20, 25, or a range of 8 to30, Y is equal to 8, 10, 12, 15, 20, 25, or a range of 8 to 30, and N isequal to one of the following values: 2159; 2443; 4452; 5733; 8438;11843; 1983; 12080; 12221; 12947; 13147; 13194; 13310; 13342; 13367;13594; 13680; 13902; 16231; 16388; 17608; 18034; 18290; 18786; 22835;22872; 25183; 25192; 25614; 26911; 32703; 34491; 34756; 34934; 5160;39897; 40598; 40816; 40947; 45783; 47929; 48206; 48207; 49282; 50037;50054; 50101; 50220; 50440; 50562; 50653; 50660; 50745; 50885; 51249;51333; 51435; 51468; 51515; 51557; 51566; 51632; 51666; 52016; 52096;52151; 52282; 52348; 52410; 52580; 52712; 52772; 52860; 53092; 53272;53389; 53511; 53600; 53665; 53815; 54365; and 54541; wherein thenucleotide at position N is selected from one of the two allelesspecified in the definition of PG1-realted biallelic marker at thebiallelic marker site at position N.

[0412] Any of the polynucleotides of the present invention can belabeled, if desired, by incorporating a label detectable byspectroscopic, photochemical, biochemical, immunochemical, or chemicalmeans. For example, useful labels include radioactive substances,fluorescent dyes or biotin. Preferably, polynucleotides are labeled attheir 3′ and 5′ ends. A label can also be used to capture the primer, soas to facilitate the immobilization of either the primer or a primerextension product, such as amplified DNA, on a solid support. A capturelabel is attached to the primers or probes and can be a specific bindingmember which forms a binding pair with the solid's phase reagent'sspecific binding member (e.g. biotin and streptavidin). Thereforedepending upon the type of label carried by a polynucleotide or a probe,it is employed to capture or to detect the target DNA. Further, it willbe understood that the polynucleotides, primers or probes providedherein, may, themselves, serve as the capture label. For example, in thecase where a solid phase reagent's binding member is a nucleic acidsequence, it is selected such that it binds a complementary portion of aprimer or probe to thereby immobilize the primer or probe to the solidphase. In cases where a polynucleotide probe itself serves as thebinding member, those skilled in the art will recognize that the probewill contain a sequence or “tail” that is not complementary to thetarget. In the case where a polynucleotide primer itself serves as thecapture label, at least a portion of the primer will be free tohybridize with a nucleic acid on a solid phase. DNA Labeling techniquesare well known to the skilled technician.

[0413] Any of the polynucleotides, primers and probes of the presentinvention can be conveniently immobilized on a solid support. Solidsupports are known to those skilled in the art and include the walls ofwells of a reaction tray, test tubes, polystyrene beads, magnetic beads,nitrocellulose strips, membranes, microparticles such as latexparticles, sheep (or other animal) red blood cells, duracytes® andothers. The solid support is not critical and can be selected by oneskilled in the art. Thus, latex particles, microparticles, magnetic ornon-magnetic beads, membranes, plastic tubes, walls of microtiter wells,glass or silicon chips, sheep (or other suitable animal's) red bloodcells and duracytes are all suitable examples. Suitable methods forimmobilizing nucleic acids on solid phases include ionic, hydrophobic,covalent interactions and the like. A solid support, as used herein,refers to any material which is insoluble, or can be made insoluble by asubsequent reaction. The solid support can be chosen for its intrinsicability to attract and immobilize the capture reagent. Alternatively,the solid phase can retain an additional receptor which has the abilityto attract and immobilize the capture reagent. The additional receptorcan include a charged substance that is oppositely charged with respectto the capture reagent itself or to a charged substance conjugated tothe capture reagent. As yet another alternative, the receptor moleculecan be any specific binding member which is immobilized upon (attachedto) the solid support and which has the ability to immobilize thecapture reagent through a specific binding reaction. The receptormolecule enables the indirect binding of the capture reagent to a solidsupport material before the performance of the assay or during theperformance of the assay. The solid phase thus can be a plastic,derivatized plastic, magnetic or non-magnetic metal, glass or siliconsurface of a test tube, microtiter well, sheet, bead, microparticle,chip, sheep (or other suitable animal's) red blood cells, duracytes® andother configurations known to those of ordinary skill in the art. Thepolynucleotides of the invention can be attached to or immobilized on asolid support individually or in groups of at least 2, 5, 8, 10, 12, 15,20, or 25 distinct polynucleotides of the inventions to a single solidsupport. In addition, polynucleotides other than those of the inventionmay attached to the same solid support as one or more polynucleotides ofthe invention.

[0414] Any polynucleotide provided herein is attached in overlappingareas or at random locations on the solid support. Alternatively thepolynucleotides of the invention is attached in an ordered array whereineach polynucleotide is attached to a distinct region of the solidsupport which does not overlap with the attachment site of any otherpolynucleotide. Preferably, such an ordered array of polynucleotides isdesigned to be “addressable” where the distinct locations are recordedand can be accessed as part of an assay procedure. Addressablepolynucleotide arrays typically comprise a plurality of differentoligonucleotide probes that are coupled to a surface of a substrate indifferent known locations. The knowledge of the precise location of eachpolynucleotides location makes these “addressable” arrays particularlyuseful in hybridization assays. Any addressable array technology knownin the art can be employed with the polynucleotides of the invention.One particular embodiment of these polynucleotide arrays is known as theGenechips™, and has been generally described in U.S. Pat. No. 5,143,854;PCT publications WO 90/15070 and 92/10092. These arrays may generally beproduced using mechanical synthesis methods or light directed synthesismethods, which incorporate a combination of photolithographic methodsand solid phase oligonucleotide synthesis (Fodor et al., Science,251:767-777, 1991). The immobilization of arrays of oligonucleotides onsolid supports has been rendered possible by the development of atechnology generally identified as “Very Large Scale Immobilized PolymerSynthesis” (VLSIPS™) in which, typically, probes are immobilized in ahigh density array on a solid surface of a chip. Examples of VLSIPS™technologies are provided in US Pat. Nos. 5,143,854 and 5,412,087 and inPCT Publications WO 90/15070, WO 92/10092 and WO 95/11995, whichdescribe methods for forming oligonucleotide arrays through techniquessuch as light-directed synthesis techniques. In designing strategiesaimed at providing arrays of nucleotides immobilized on solid supports,further presentation strategies were developed to order and display theoligonucleotide arrays on the chips in an attempt to maximizehybridization patterns and sequence information. Examples of suchpresentation strategies are disclosed in PCT Publications WO 94/12305,WO 94/11530, WO 97/29212 and WO 97/31256.

[0415] Oligonucleotide arrays may comprise at least one of the sequencesselected from the group consisting of SEQ ID NOs: 3, 21-38, 57-62,100-124, 179, 185-338, the preferred hybridization probes for genicPG3-related biallelic markers described above; and the sequencescomplementary thereto; or a fragment thereof of at least 15 consecutivenucleotides for determining whether a sample contains one or morealleles of the biallelic markers of the present invention.Oligonucleotide arrays may also comprise at least one of the sequencesselected from the group consisting of SEQ ID NOs: 179, 339-424; and thesequences complementary thereto or a fragment thereof of at least 15consecutive nucleotides for amplifying one or more alleles of thePG1-realted biallelic markers. In other embodiments, arrays may alsocomprise at least one of the sequences selected from the groupconsisting of SEQ ID 425-578, the preferred microsequencing primers forgenic PG1-related biallelic markers described above; and the sequencescomplementary thereto or a fragment thereof of at least 15 consecutivenucleotides for conducting microsequencing analyses to determine whethera sample contains one or more alleles of PG1-related biallelic marker.

[0416] The present invention further encompasses polynucleotidesequences that hybridize to any one of SEQ ID NOs: 3, 69, 100-112, or179-184 under conditions of high or intermediate stringency as describedbelow:

[0417] (i) By way of example and not limitation, procedures usingconditions of high stringency are as follows: Prehybridization offilters containing DNA is carried out for 8 h to overnight at 65° C. inbuffer composed of 6×SSC, 50 mM Tris-HCl (pH 7.5), 1 mM EDTA, 0.02% PVP,0.02% Ficoll, 0.02% BSA, and 500 jig/ml denatured salmon sperm DNA.Filters are hybridized for 48 h at 65° C., the preferred hybridizationtemperature, in prehybridization mixture containing 100 μg/ml denaturedsalmon sperm DNA and 5-20×10⁶ cpm of ³²P-labeled probe. Alternatively,the hybridization step can be performed at 65° C. in the presence of SSCbuffer, 1×SSC corresponding to 0.15 M NaCl and 0.05 M Na citrate.Subsequently, filter washes can be done at 37° C. for 1 h in a solutioncontaining 2×SSC, 0.01% PVP, 0.01% Ficoll, and 0.01% BSA, followed by awash in 0.1×SSC at 50° C. for 45 min. Alternatively, filter washes canbe performed in a solution containing 2×SSC and 0.1% SDS, or 0.5×SSC and0.1% SDS, or 0.1×SSC and 0.1% SDS at 68° C. for 15 minute intervals.Following the wash steps, the hybridized probes are detectable byautoradiography. Other conditions of high stringency which is used arewell known in the art and as cited in Sambrook et al., 1989, MolecularCloning, A Laboratory Manual, Second Edition, Cold Spring Harbor Press,N.Y., pp. 9.47-9.57; and Ausubel et al., 1989, Current Protocols inMolecular Biology, Green Publishing Associates and Wiley Interscience,N.Y. Preferably, such sequences encode a homolog of a polypeptideencoded by one of ORF2 to ORF1297. In one embodiment, such sequencesencode a mammalian PG1 polypeptide.

[0418] (ii) By way of example and not limitation, procedures usingconditions of intermediate stringency are as follows: Filters containingDNA are prehybridized, and then hybridized at a temperature of 60° C. inthe presence of a 5×SSC buffer and labeled probe. Subsequently, filterswashes are performed in a solution containing 2×SSC at 50° C. and thehybridized probes are detectable by autoradiography. Other conditions ofintermediate stringency which is used are well known in the art and ascited in Sambrook et al., 1989, Molecular Cloning, A Laboratory Manual,Second Edition, Cold Spring Harbor Press, N.Y., pp. 9.47-9.57; andAusubel et al., 1989, Current Protocols in Molecular Biology, GreenPublishing Associates and Wiley Interscience, N.Y. Preferably, suchsequences encode a homolog of a polypeptide encoded by one of SEQ IDNOs: 3, 69, 100-112, or 179-184. In one embodiment, such sequencesencode a mammalian PG1 polypeptide.

[0419] The present invention also encompasses diagnostic kits comprisingone or more polynucleotides of the invention with a portion or all ofthe necessary reagents and instructions for genotyping a test subject bydetermining the identity of a nucleotide at a PG1-related biallelicmarker. The polynucleotides of a kit may optionally be attached to asolid support, or be part of an array or addressable array ofpolynucleotides. The kit may provide for the determination of theidentity of the nucleotide at a marker position by any method known inthe art including, but not limited to, a sequencing assay method, amicrosequencing assay method, a hybridization assay method, or an allelespecific amplification method. Optionally such a kit may includeinstructions for scoring the results of the determination with respectto the test subjects' risk of contracting a cancer or prostate cancer,or likely response to an anti-cancer agent or anti-prostate canceragent, or chances of suffering from side effects to an anti-cancer agentor anti-prostate cancer agent.

Preferred Genomic Sequences Of The PG-1 Gene

[0420] The present invention concerns the genomic sequence of PG-1. Thepresent invention encompasses the PG-1 gene, or PG-1 genomic sequencesconsisting of, consisting essentially of, or comprising the sequence ofSEQ ID No 179, a sequence complementary thereto, as well as fragmentsand variants thereof. These polynucleotides may be purified, isolated,or recombinant.

[0421] The invention also encompasses a purified, isolated, orrecombinant polynucleotide comprising a nucleotide sequence having atleast 70, 75, 80, 85, 90, or 95% nucleotide identity with a nucleotidesequence of SEQ ID No 179 or a complementary sequence thereto or afragment thereof. The nucleotide differences as regards to thenucleotide sequence of SEQ ID No 179 may be generally randomlydistributed throughout the entire nucleic acid. Nevertheless, preferrednucleic acids are those wherein the nucleotide differences as regards tothe nucleotide sequence of SEQ ID No 179 are predominantly locatedoutside the coding sequences contained in the exons. These nucleicacids, as well as their fragments and variants, may be used asoligonucleotide primers or probes in order to detect the presence of acopy of the PG-1 gene in a test sample, or alternatively in order toamplify a target nucleotide sequence within the PG-1 sequences.

[0422] Another object of the invention consists of a purified, isolated,or recombinant nucleic acid that hybridizes with the nucleotide sequenceof SEQ ID No 179 or a complementary sequence thereto or a variantthereof, under the stringent hybridization conditions as defined above.

[0423] Particularly preferred nucleic acids of the invention includeisolated, purified, or recombinant polynucleotides comprising acontiguous span of at least 12, 15, 18, 20, 25, 30, 35, 40, 50, 60, 70,80, 90, 100, 150, 200, 500, or 1000 nucleotides of SEQ ID No 179 or thecomplements thereof, wherein said contiguous span comprises at least 1,2, 3, 5, 10, or 25 of the following nucleotide positions of SEQ ID No179: 1-2324, 2852-2936, 3204-3249, 3456-3572, 3899-4996, 5028-6086,6310-8710, 9136-11170, 11534-12104, 12733-13163, 13206-14150,14191-14302, 14338-14359, 14788-15589, 16050-16409, 16440-21718,21959-22007, 22086-23057, 23488-23712, 23832-24099, 24165-24376,24429-24568, 24607-25096, 25127-25269, 25300-27576, 27612-29217,29415-30776, 30807-30986, 31628-32658, 32699-36324, 36772-39149,3918440269, 40580-40683, 4084441048, 41271-43539, 4357047024,47510-48065, 48192-49692, 49723-50174, 52626-53599, 54516-55209, and55666-56146.

Preferred PG-1 cDNA Sequences

[0424] The expression of the PG-1 gene has been shown to lead to theproduction of at least one mRNA species, the nucleic acid sequence ofwhich is set forth in SEQ ID No 3.

[0425] Another object of the invention is a purified, isolated, orrecombinant nucleic acid comprising the nucleotide sequence of SEQ ID No3, complementary sequences thereto, as well as allelic variants, andfragments thereof. Moreover, preferred polynucleotides of the inventioninclude purified, isolated, or recombinant PG-1 cDNAs consisting of,consisting essentially of, or comprising the sequence of SEQ ID No 3.Particularly preferred nucleic acids of the invention include isolated,purified, or recombinant polynucleotides comprising a contiguous span ofat least 12, 15, 18, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 150,200, 500, or 1000 nucleotides of SEQ ID No 3 or the complements thereof,wherein said contiguous span comprises at least 1, 2, 3, 5, 10, or 25 ofthe following nucleotide positions of SEQ ID No 3: 1-280, 651-690,33154288, and 5176-5227. The invention also pertains to a purified orisolated nucleic acid comprising a polynucleotide having at least 95%nucleotide identity with a polynucleotide of SEQ ID No 3, advantageously99% nucleotide identity, preferably 99.5% nucleotide identity and mostpreferably 99.8% nucleotide identity with a polynucleotide of SEQ ID No3, or a sequence complementary thereto or a biologically active fragmentthereof.

Preferred Oligonucleotide Probes And Primers

[0426] Polynucleotides derived from the PG-1 gene are useful in order todetect the presence of at least a copy of a nucleotide sequence of SEQID No 179, or a fragment, complement, or variant thereof in a testsample.

[0427] Particularly preferred probes and primers of the inventioninclude isolated, purified, or recombinant polynucleotides comprising acontiguous span of at least 12, 15, 18, 20, 25, 30, 35, 40, 50, 60, 70,80, 90, 100, 150, 200, 500, or 1000 nucleotides of SEQ ID No 179 or thecomplements thereof, wherein said contiguous span comprises at least 1,2, 3, 5, 10, or 25 of the following nucleotide positions of SEQ ID No179: 1-2324, 2852-2936, 3204-3249, 3456-3572, 3899-4996, 5028-6086,6310-8710, 9136-11170, 11534-12104, 12733-13163, 13206-14150,14191-14302, 14338-14359, 14788-15589, 16050-16409, 16440-21718,21959-22007, 22086-23057, 23488-23712, 23832-24099, 24165-24376,24429-24568, 24607-25096, 25127-25269, 25300-27576, 27612-29217,29415-30776, 30807-30986, 31628-32658, 32699-36324, 36772-39149,39184-40269, 40580-40683, 40844-41048, 41271-43539, 43570-47024,47510-48065, 48192-49692, 49723-50174, 52626-53599, 54516-55209, and55666-56146.

[0428] Another object of the invention is a purified, isolated, orrecombinant nucleic acid comprising the nucleotide sequence of SEQ ID No3, complementary sequences thereto, as well as allelic variants, andfragments thereof. Moreover, preferred probes and primers of theinvention include purified, isolated, or recombinant PG-1 cDNAsconsisting of, consisting essentially of, or comprising the sequence ofSEQ ID No 3. Particularly preferred probes and primers of the inventioninclude isolated, purified, or recombinant polynucleotides comprising acontiguous span of at least 12, 15, 18, 20, 25, 30, 35, 40, 50, 60, 70,80, 90, 100, 150, 200, 500, or 1000 nucleotides of SEQ ID No 3 or thecomplements thereof, wherein said contiguous span comprises at least 1,2, 3, 5, 10, or 25 of the following nucleotide positions of SEQ ID No 3:1-280, 651-690, 33154288, and 5176-5227.

Use of PG1 Nucleic Acids as Reagents

[0429] The PG1 genomic DNA of SEQ ID NO: 179, the PG1 cDNA of SEQ ID NO:3, 112-124 and PG1 alleles responsible for a detectable phenotype (suchas those obtainable by the methods of Example 12, and SEQ ID NO: 69) canbe used to prepare PCR primers for use in diagnostic techniques orgenetic engineering methods such as those described above. Example 10describes the use of the PG1 genomic DNA of SEQ ID NO: 179, the PG1 cDNAof SEQ ID NO: 3, 112-124 and PG1 alleles responsible for a detectablephenotype (such as those obtainable by the methods of Example 12) in PCRamplification procedures.

EXAMPLE 10

[0430] The PG1 genomic DNA of SEQ ID NO: 179, the PG1 cDNA of SEQ ID NO:3, and PG1 alleles responsible for a detectable phenotype (such as thoseobtainable by the methods of Example 12) is used to prepare PCR primersfor a variety of applications, including isolation procedures forcloning nucleic acids capable of hybridizing to such sequences,diagnostic techniques and forensic techniques. The PCR primers compriseat least 10 consecutive bases of the PG1 genomic DNA of SEQ ID NO: 179,the PG1 cDNA of SEQ ID NO: 3, 112-124 and PG1 alleles responsible for adetectable phenotype (such as those obtainable by the methods of Example12) or the sequences complementary thereto. Preferably, the PCR primerscomprise at least 12, 15, or 17 consecutive bases of these sequences.More preferably, the PCR primers comprise at least 20-30 consecutivebases of the PG1 genomic DNA of SEQ ID NO: 179, the PG1 cDNA of SEQ IDNO: 3, 112-124 and PG1 alleles responsible for a detectable phenotype(such as those obtainable by the methods of Example 12) or the sequencescomplementary thereto. In some embodiments, the PCR primers may comprisemore than 30 consecutive bases of the PG1 genomic DNA of SEQ ID NO: 179,the PG1 cDNA of SEQ ID NO: 3, 112-124 and PG1 alleles responsible for adetectable phenotype (such as those obtainable by the methods of Example12) or the sequences complementary thereto. It is preferred that theprimer pairs to be used together in a PCR amplification haveapproximately the same G/C ratio, so that melting temperatures areapproximately the same.

[0431] A variety of PCR techniques are familiar to those skilled in theart. For a review of PCR technology, see Molecular Cloning to GeneticEngineering White, B. A. Ed. in Methods in Molecular Biology 67: HumanaPress, Totowa 1997. In each of these PCR procedures, PCR primers oneither side of the nucleic acid sequences to be amplified are added to asuitably prepared nucleic acid sample along with dNTPs and athermostable polymerase such as Taq polymerase, Pfu polymerase, or Ventpolymerase. The nucleic acid in the sample is denatured and the PCRprimers are specifically hybridized to complementary nucleic acidsequences in the sample. The hybridized primers are extended.Thereafter, another cycle of denaturation, hybridization, and extensionis initiated. The cycles are repeated multiple times to produce anamplified fragment containing the nucleic acid sequence between theprimer sites.

[0432] The polynucleotides of the Invention also encompass vectors andDNA constructs as well as other forms of primers and probes. For athorough description of these embodiments please see Sections VIII, X,and XI below.

[0433] III. POLYPEPTIDES

PG1 Proteins and Polypeptide Fragments

[0434] The term “PG1 polypeptides” is used herein to embrace all of theproteins and polypeptides of the present invention. Also forming part ofthe invention are polypeptides encoded by the polynucleotides of theinvention, as well as fusion polypeptides comprising such polypeptides.The invention embodies PG1 proteins from human (SEQ ID NOs: 4, and 5),and mouse (SEQ ID NO: 74). However, PG1 species from other varieties ofmammals are expressly contemplated and is isolated using the antibodiesof the present invention in conjunction with standard affinitychromatography methods as well as being expressed from the PG1 genesisolated from other mammalian sources using human and mouse PG1 nucleicacid sequences as primers and probes as well as the methods describedherein.

[0435] The invention also embodies PG1 proteins translated from lesscommon alternative splice species, including SEQ ID NOs: 125-136, andPG1 proteins which result from naturally occurring mutant, particularlyfunctional mutants of PG1, including SEQ ID NO: 70, which is identifiedand obtained by the described herein. The present invention alsoembodies polypeptides comprising a contiguous stretch of at least 6amino acids, preferably at least 8 to 10 amino acids, more preferably atleast 12, 15, 20, 25, 50, or 100 amino acids of a PG1 protein. In apreferred embodiment the contiguous stretch of amino acids comprises thesite of a mutation or functional mutation, including a deletion,addition, swap or truncation of the amino acids in the PG1 proteinsequence. For instance, polypeptides that contain either the Arg and Hisresidues at amino acid position 184, and polypeptides that containeither the Arg or Ile residue at amino acid position 293 of the SEQ IDNO: 4 in said contiguous stretch are particularly preferred embodimentsof the invention and useful in the manufacture of antibodies to detectthe presence and absence of these mutations. Similarly, polypeptideswith a carboxy terminus at position 228 is a particularly preferredembodiment of the invention and useful in the manufacture of antibodiesto detect the presence and absence of the mutation shown in SEQ ID NOs:69 and 70.

[0436] Similarly, polypeptides that that contain an peptide sequences of8, 10, 12, 15, or 25 amino acids encoded over a naturally-occurringsplice junction (the point at which two human PG1 exon (SEQ ID NOs:100-111) are covalently linked) in said contiguous stretch areparticularly preferred embodiments and useful in the manufacture ofantibodies to detect the presence, localization, and quantity of thevarious protein products of the PG1 alternative splice species.

[0437] PG1 proteins are preferably isolated from human, mouse ormammalian tissue samples or expressed from human, mouse or mammaliangenes.

[0438] The PG1 polypeptides of the invention can be made using routineexpression methods known in the art, see, for instance, Example 11,below. The polynucleotide encoding the desired polypeptide, is ligatedinto an expression vector suitable for any convenient host. Botheukaryotic and prokaryotic host systems is used in forming recombinantpolypeptides, and a summary of some of the more common systems areincluded in Sections II and VIII. The polypeptide is then isolated fromlysed cells or from the culture medium and purified to the extent neededfor its intended use. Purification is by any technique known in the art,for example, differential extraction, salt fractionation,chromatography, centrifugation, and the like. See, for example, Methodsin Enzymology for a variety of methods for purifying proteins.

[0439] In addition, shorter protein fragments is produced by chemicalsynthesis. Alternatively the proteins of the invention is extracted fromcells or tissues of humans or non-human animals. Methods for purifyingproteins are known in the art, and include the use of detergents orchaotropic agents to disrupt particles followed by differentialextraction and separation of the polypeptides by ion exchangechromatography, affinity chromatography, sedimentation according todensity, and gel electrophoresis.

Preferred PG-1 Proteins and Polypeptide Fragments

[0440] The invention embodies PG-1 proteins from humans, includingisolated or purified PG-1 proteins consisting, consisting essentially,or comprising the sequence of SEQ ID No 4.

[0441] The present invention also embodies isolated, purified, andrecombinant polypeptides comprising a contiguous span of at least 6amino acids, preferably at least 8 to 10 amino acids, more preferably atleast 12, 15, 20, 25, 30, 40, 50, or 100 amino acids of SEQ ID No 4,wherein said contiguous span includes at least 1, 2, 3, or 5 of theamino acid positions 1-26, 295-302, and 333-353. In other preferredembodiments the contiguous stretch of amino acids comprises the site ofa mutation or functional mutation, including a deletion, addition, swapor truncation of the amino acids in the PG-1 protein sequence.

Expression of the PG1 Protein

[0442] Any PG1 cDNA, including SEQ ID NO: 3, 69, 112-124, or 184 orsynthetic DNAs is use as described in Example 11 below to express PG1proteins and polypeptides.

EXAMPLE 11

[0443] The nucleic acid encoding the PG1 protein or polypeptide to beexpressed is operably linked to a promoter in an expression vector usingconventional cloning technology. The PG1 insert in the expression vectormay comprise the fall coding sequence for the PG1 protein or a portionthereof. For example, the PG1 derived insert may encode a polypeptidecomprising at least 10 consecutive amino acids of the PG1 proteins ofSEQ ID NO: 4.

[0444] The expression vector is any of the mammalian, yeast, insect orbacterial expression systems known in the art, see for example SectionVIII. Commercially available vectors and expression systems areavailable from a variety of suppliers including Genetics Institute(Cambridge, Mass.), Stratagene (La Jolla, Calif.), Promega (Madison,Wis.), and Invitrogen (San Diego, Calif.). If desired, to enhanceexpression and facilitate proper protein folding, the codon context andcodon pairing of the sequence is optimized for the particular expressionorganism in which the expression vector is introduced, as explained byHatfield, et al., U.S. Pat No. 5,082,767.

[0445] The following is provided as one exemplary method to express thePG1 protein or a portion thereof. In one embodiment, the entire codingsequence of the PG1 cDNA through the poly A signal of the cDNA areoperably linked to a promoter in the expression vector. Alternatively,if the nucleic acid encoding a portion of the PG1 protein lacks amethionine to serve as the initiation site, an initiating methionine canbe introduced next to the first codon of the nucleic acid usingconventional techniques. Similarly, if the insert from the PG1 cDNAlacks a poly A signal, this sequence can be added to the construct by,for example, splicing out the Poly A signal from pSG5 (Stratagene) usingBglI and SalI restriction endonuclease enzymes and incorporating it intothe mammalian expression vector pXT1 (Stratagene). pXT1 contains theLTRs and a portion of the gag gene from Moloney Murine Leukemia Virus.The position of the LTRs in the construct allow efficient stabletransfection. The vector includes the Herpes Simplex Thymidine Kinasepromoter and the selectable neomycin gene. The nucleic acid encoding thePG1 protein or a portion thereof is obtained by PCR from a bacterialvector containing the PG1 cDNA of SEQ ID NO: 3 using oligonucleotideprimers complementary to the PG1 cDNA or portion thereof and containingrestriction endonuclease sequences for Pst I incorporated into the 5′primer and BgIII at the 5′ end of the corresponding cDNA 3′ primer,taking care to ensure that the sequence encoding the PG1 protein or aportion thereof is positioned properly with respect to the poly Asignal. The purified fragment obtained from the resulting PCR reactionis digested with PstI, blunt ended with an exonuclease, digested withBgl II, purified and ligated to pXT1, now containing a poly A signal anddigested with BglII.

[0446] The ligated product is transfected into mouse NIH 3T3 cells usingLipofectin (Life Technologies, Inc., Grand Island, N.Y.) underconditions outlined in the product specification. Positive transfectantsare selected after growing the transfected cells in 600ug/mil G418(Sigma, St. Louis, Mo.).

[0447] Alternatively, the nucleic acids encoding the PG1 protein or aportion thereof is cloned into pED6dpc2 (Genetics Institute, Cambridge,Mass.). The resulting pED6dpc2 constructs is transfected into a suitablehost cell, such as COS 1 cells. Methotrexate resistant cells areselected and expanded.

[0448] The above procedures may also be used to express a mutant PG1protein responsible for a detectable phenotype or a portion thereof.

[0449] The expressed proteins is purified using conventionalpurification techniques such as ammonium sulfate precipitation orchromatographic separation based on size or charge. The protein encodedby the nucleic acid insert may also be purified using standardimmunochromatography techniques. In such procedures, a solutioncontaining the expressed PG1 protein or portion thereof, such as a cellextract, is applied to a column having antibodies against the PG1protein or portion thereof is attached to the chromatography matrix. Theexpressed protein is allowed to bind the immunochromatography column.Thereafter, the column is washed to remove non-specifically boundproteins. The specifically bound expressed protein is then released fromthe column and recovered using standard techniques.

[0450] To confirm expression of the PG1 protein or a portion thereof,the proteins expressed from host cells containing an expression vectorcontaining an insert encoding the PG1 protein or a portion thereof canbe compared to the proteins expressed in host cells containing theexpression vector without an insert. The presence of a band in samplesfrom cells containing the expression vector with an insert which isabsent in samples from cells containing the expression vector without aninsert indicates that the PG1 protein or a portion thereof is beingexpressed. Generally, the band will have the mobility expected for thePG1 protein or portion thereof However, the band may have a mobilitydifferent than that expected as a result of modifications such asglycosylation, ubiquitination, or enzymatic cleavage.

[0451] Antibodies capable of specifically recognizing the expressed PG1protein or a portion thereof is generated as described below in SectionVII.

[0452] If antibody production is not possible, the nucleic acidsencoding the PG1 protein or a portion thereof is incorporated intoexpression vectors designed for use in purification schemes employingchimeric polypeptides. In such strategies the nucleic acid encoding thePG1 protein or a portion thereof is inserted in frame with the geneencoding the other half of the chimera. The other half of the chimera isβ-globin or a nickel binding polypeptide encoding sequence. Achromatography matrix having antibody to β-globin or nickel attachedthereto is then used to purify the chimeric protein. Protease cleavagesites is engineered between the β-globin gene or the nickel bindingpolypeptide and the PG1 protein or portion thereof. Thus, the twopolypeptides of the chimera is separated from one another by proteasedigestion. One useful expression vector for generating β-globinchimerics is pSG5 (Stratagene), which encodes rabbit β-globin. Intron IIof the rabbit β-globin gene facilitates splicing of the expressedtranscript, and the polyadenylation signal incorporated into theconstruct increases the level of expression. These techniques are wellknown to those skilled in the art of molecular biology. Standard methodsare published in methods texts such as Davis et al., (Basic Methods inMolecular Biology, L. G. Davis, M. D. Dibner, and J. F. Battey, ed.,Elsevier Press, NY, 1986) and many of the methods are available fromStratagene, Life Technologies, Inc., or Promega. Polypeptide mayadditionally be produced from the construct using in vitro translationsystems such as the In vitro Express™ Translation Kit (Stratagene).

[0453] IV. IDENTIFICATION OF MUTATIONS IN THE PG1 GENE WHICH AREASSOCIATED WITH A DETECTABLE PHENOTYPE

[0454] Mutations in the PG1 gene which are responsible for a detectablephenotype is identified by comparing the sequences of the PG1 genes fromaffected and unaffected individuals as described in Example 12, below.The detectable phenotype may comprise a variety of manifestations ofaltered PG1 function, including prostate cancer, hepatocellularcarcinoma, colorectal cancer, non-small cell lung cancer, squamous cellcarcinoma, or other conditions. The mutations may comprise pointmutations, deletions, or insertions of the PG1 gene. The mutations maylie within the coding sequence for the PG1 protein or within regulatoryregions in the PG1 gene.

EXAMPLE 12

[0455] Oligonucleotide primers are designed to amplify the sequences ofeach of the exons or the promoter region of the PG1 gene. Theoligonucleotide primers may comprise at least 10 consecutive nucleotidesof the PG1 genomic DNA of SEQ ID NO: 179 or the PG1 cDNA of SEQ ID NO: 3or the sequences complementary thereto. Preferably, the oligonucleotidescomprise at least 15 consecutive nucleotides of the PG1 genomic DNA ofSEQ ID NO: 179 or the PG1 cDNA of SEQ ID NO: 3 or the sequencescomplementary thereto. In some embodiments, the oligonucleotides maycomprise at least 20 consecutive nucleotides of the PG1 genomic DNA ofSEQ ID NO: 179 or the PG1 cDNA of SEQ ID NO: 3 or the sequencescomplementary thereto. In other embodiments, the oligonucleotides maycomprise 25 or more consecutive nucleotides of the PG1 genomic DNA ofSEQ ID NO: 179 or the PG1 cDNA of SEQ ID NO: 3 or the sequencescomplementary thereto.

[0456] Each primer pair is used to amplify the exon or promoter regionfrom which it is derived. Amplification is carried out on genomic DNAsamples from affected patients and unaffected controls using the PCRconditions described above. Amplification products from the genomic PCRsare subjected to automated dideoxy terminator sequencing reactions andelectrophoresed on ABI 377 sequencers. Following gel image analysis andDNA sequence extraction, ABI sequence data are automatically analyzed todetect the presence of sequence variations among affected and unaffectedindividuals. Sequences are verified by determining the sequences of bothDNA strands for each individual. Preferably, these candidate mutationsare detected by comparing individuals homozygous for haplotype 5 of FIG.4 and controls not carrying haplotype 5 or related haplotypes.

[0457] Candidate polymorphisms suspected of being responsible for thedetectable phenotype, such as prostate cancer or other conditions, arethen verified by screening a larger population of affected andunaffected individuals using the microsequencing technique describedabove. Polymorphisms which exhibit a statistically significantcorrelation with the detectable phenotype are deemed responsible for thedetectable phenotype.

[0458] Other techniques may also be used to detect polymorphismsassociated with a detectable phenotype such as prostate cancer or otherconditions. For example, polymorphisms is detected using single strandedconformation analyses such as those described in Orita et al., Proc.Natl. Acad. Sci. U.S.A. 86: 2776-2770 (1989). In this approach,polymorphisms are detected through altered migration on SSCA gels.

[0459] Alternatively, polymorphisms is identified using clampeddenaturing gel electrophoresis, heteroduplex analysis, chemical mismatchcleavage, and other conventional techniques as described in Sheffield,V. C. et al, Proc. Natl. Acad. Sci. U.S.A 49:699-706 (1991); White, M.B. et al., Genomics 12:301-306 (1992); Grompe, M. et al., Proc. Natl.Acad. Sci. U.S.A 86:5855-5892 (1989); and Grompe, M. Nature Genetics5:111-117 (1993).

[0460] The PG1 genes from individuals carrying PG1 mutations responsiblefor the detectable phenotype, or cDNAs derived therefrom, is cloned asfollows. Nucleic acid samples are obtained from individuals having a PG1mutation associated with the detectable phenotype. The nucleic acidsamples are contacted with a probe derived from the PG1 genomic DNA ofSEQ ID NO: 179 or the PG1 cDNA of SEQ ID NO: 3. Nucleic acids containingthe mutant PG1 allele are identified using conventional techniques. Forexample, the mutant PG1 gene, or a cDNA derived therefrom, is obtainedby conducting an amplification reaction using primers derived from thePG1 genomic DNA of SEQ ID NO: 179 or the PG1 cDNA of SEQ ID NO: 3.Alternatively, the mutant PG1 gene, or a cDNA derived therefrom, isidentified by hybridizing a genomic library or a cDNA library obtainedfrom an individual having a mutant PG1 gene with a detectable probederived from the PG1 genomic DNA of SEQ ID NO: 179 or the PG1 cDNA ofSEQ ID NO: 3. Alternatively, the mutant PG1 allele is obtained bycontacting an expression library from an individual carrying a PG1mutation with a detectable antibody against the PG1 proteins of SEQ IDNO: 4 or SEQ ID NO: 5 which has been prepared as described below. Thoseskilled in the art will appreciate that the PG1 genomic DNA of SEQ IDNO: 179, the PG1 cDNA of SEQ ID NO: 3 and the PG1 proteins of SEQ IDNOs: 4 and 5 is used in a variety of other conventional techniques toobtain the mutant PG1 gene.

[0461] In another embodiment the mutant PG1 allele which causes adetectable phenotype can be isolated by obtaining a nucleic acid samplesuch as a genomic library or a cDNA library from an individualexpressing the detectable phenotype. The nucleic acid sample can becontacted with one or more probes lying in the 8p23 region of the humangenome. Nucleic acids in the sample which contain the PG1 gene can beidentified by conducting sequencing reactions on the nucleic acids whichhybridize to the markers from the 8p23 region of the human genome.

[0462] The region of the PG1 gene containing the mutation responsiblefor the detectable phenotype may also be used in diagnostic techniquessuch as those described below. For example, oligonucleotides containingthe mutation responsible for the detectable phenotype is used inamplification or hybridization based diagnostics, such as thosedescribed herein, for detecting individuals suffering from thedetectable phenotype or individuals at risk of developing the detectablephenotype at a subsequent time. In addition, the PG1 allele responsiblefor the detectable phenotype is used in gene therapy as describedherein. The PG1 allele responsible for the detectable phenotype may alsobe cloned into an expression vector to express the mutant PG1 protein adescribed herein.

[0463] During the search for biallelic markers associated with prostatecancer, a number of polymorphic bases were discovered which lie withinthe PG1 gene. The identities and positions of these polymorphic basesare listed as features in the accompanying Sequence Listing for the PG1genomic DNA of SEQ ID NO: 179. The polymorphic bases is used in theabove-described diagnostic techniques for determining whether anindividual is at risk for developing prostate cancer at a subsequentdate or suffers from prostate cancer as a result of a PG1 mutation. Theidentities of the nucleotides present at the polymorphic positions in anucleic acid sample is determined using the techniques, such asmicrosequencing analysis, which are described above.

[0464] It is possible that one or more of these polymorphisms (or otherpolymorphic bases) is mutations which are associated with prostatecancer. To determine whether a polymorphism is responsible for prostatecancer, the frequency of each of the alleles in individuals sufferingfrom prostate cancer and unaffected individuals is measured as describedin the haplotype analysis above. Those mutations which occur at astatistically significant frequency in the affected population aredeemed to be responsible for prostate cancer.

[0465] cDNAs containing the identified mutant PG1 gene is prepared asdescribed above and cloned into expression vectors as described below.The proteins expressed from the expression vectors is used to generateantibodies specific for the mutant PG1 proteins as described below. Inaddition, allele specific probes containing the PG1 mutation responsiblefor prostate cancer is used in the diagnostic techniques describedbelow.

[0466] Genes sharing homology to the PG1 gene is identified as follows.

EXAMPLE 13

[0467] Alternatively, a cDNA library or genomic DNA library to bescreened for genes sharing homology to the PG1 gene is obtained from acommercial source or made using techniques familiar to those skilled inthe art. The cDNA library or genomic DNA library is hybridized to adetectable probe comprising at least 10 consecutive nucleotides from thePG1 cDNA of SEQ ID NO:3, the PG1 genomic DNA of SEQ ID NO: 179, or thesequences complementary thereto, using conventional techniques.Preferably, the probe comprises at least 12, 15, or 17 consecutivenucleotides from the PG1 cDNA of SEQ ID NO: 3, the PG1 genomic DNA ofSEQ ID NO: 179, or the sequences complementary thereto. More preferably,the probe comprises at least 20-30 consecutive nucleotides from the PG1cDNA of SEQ ID NO: 3, the PG1 genomic DNA of SEQ ID NO: 179, or thesequences complementary thereto. In some embodiments, the probecomprises more than 30 nucleotides from the PG1 cDNA of SEQ ID NO: 3,the PG1 genomic DNA of SEQ ID NO: 179, or the sequences complementarythereto.

[0468] Techniques for identifying cDNA clones in a cDNA library whichhybridize to a given probe sequence are disclosed in Sambrook et al.,Molecular Cloning: A Laboratory Manual 2d Ed., Cold Spring HarborLaboratory Press, 1989. The same techniques is used to isolate genomicDNAs sharing homology with the PG1 gene.

[0469] Briefly, cDNA or genomic DNA clones which hybridize to thedetectable probe are identified and isolated for further manipulation asfollows. A probe comprising at least 10 consecutive nucleotides from thePG1 cDNA of SEQ ID NO: 3, the PG1 genomic DNA of SEQ ID NO: 179, or thesequences complementary thereto, is labeled with a detectable label suchas a radioisotope or a fluorescent molecule. Preferably, the probecomprises at least 12, 15, or 17 consecutive nucleotides from the PG1cDNA of SEQ ID NO: 3, the PG1 genomic DNA of SEQ ID NO: 179, or thesequences complementary thereto. More preferably, the probe comprises20-30 consecutive nucleotides from the PG1 cDNA of SEQ ID NO: 3, the PG1genomic DNA of SEQ ID NO: 179, or the sequences complementary thereto.In some embodiments, the probe comprises more than 30 nucleotides fromthe PG1 cDNA of SEQ ID NO: 3, the PG1 genomic DNA of SEQ ID NO: 179, orthe sequences complementary thereto.

[0470] Techniques for labeling the probe are well known and includephosphorylation with polynucleotide kinase, nick translation, in vitrotranscription, and non-radioactive techniques. The cDNAs or genomic DNAsin the library are transferred to a nitrocellulose or nylon filter anddenatured. After incubation of the filter with a blocking solution, thefilter is contacted with the labeled probe and incubated for asufficient amount of time for the probe to hybridize to cDNAs or genomicDNAs containing a sequence capable of hybridizing to the probe.

[0471] By varying the stringency of the hybridization conditions used toidentify cDNAs or genomic DNAs which hybridize to the detectable probe,cDNAs or genomic DNAs having different levels of homology to the probecan be identified and isolated. To identify cDNAs or genomic DNAs havinga high degree of homology to the probe sequence, the melting temperatureof the probe is calculated using the following formulas:

[0472] For probes between 14 and 70 nucleotides in length the meltingtemperature™ is calculated using the formula: Tm=81.5+16.6(log(Na+))+0.41(fraction G+C)-(600/N) where N is the length of the probe.

[0473] If the hybridization is carried out in a solution containingformamide, the melting temperature is calculated using the equationTm=81.5+16.6(log (Na+))+0.41(fraction G+C)-(0.63% formamide)-(600/N)where N is the length of the probe.

[0474] Prehybridization is carried out in 6×SSC, 5×Denhardt's reagent,0.5% SDS, 100 μg denatured fragmented salmon sperm DNA or 6×SSC,5×Denhardt's reagent, 0.5% SDS, 100 μg denatured fragmented salmon spermDNA, 50% formamide. The formulas for SSC and Denhardt's solutions arelisted in Sambrook et al., supra.

[0475] Hybridization is conducted by adding the detectable probe to theprehybridization solutions listed above. Where the probe comprisesdouble stranded DNA, it is denatured before addition to thehybridization solution. The filter is contacted with the hybridizationsolution for a sufficient period of time to allow the probe to hybridizeto cDNAs or genomic DNAs containing sequences complementary thereto orhomologous thereto. For probes over 200 nucleotides in length, thehybridization is carried out at 15-25° C. below the Tm. For shorterprobes, such as oligonucleotide probes, the hybridization is conductedat 15-25° C. below the Tm. Preferably, for hybridizations in 6×SSC, thehybridization is conducted at approximately 68° C. Preferably, forhybridizations in 50% formamide containing solutions, the hybridizationis conducted at approximately 42° C.

[0476] All of the foregoing hybridizations would be considered to beunder “stringent” conditions.

[0477] Following hybridization, the filter is washed in 2×SSC, 0.1% SDSat room temperature for 15 minutes. The filter is then washed with0.1×SSC, 0.5% SDS at room temperature for 30 minutes to 1 hour.Thereafter, the solution is washed at the hybridization temperature in0.1×SSC, 0.5% SDS. A final wash is conducted in 0.1×SSC at roomtemperature. cDNAs or genomic DNAs homologous to the PG1 gene which havehybridized to the probe are identified by autoradiography or otherconventional techniques.

[0478] The above procedure is modified to identify cDNAs or genomic DNAshaving decreasing levels of homology to the probe sequence. For example,to obtain cDNAs or genomic DNAs of decreasing homology to the detectableprobe, less stringent conditions is used. For example, the hybridizationtemperature is decreased in increments of 5° C. from 68° C. to 42° C. ina hybridization buffer having a Na+concentration of approximately 1M.Following hybridization, the filter is washed with 2×SSC, 0.5% SDS atthe temperature of hybridization. These conditions are considered to be“moderate” conditions above 50° C. and “low” conditions below 50° C.

[0479] Alternatively, the hybridization is carried out in buffers, suchas 6×SSC, containing formamide at a temperature of 42° C. In this case,the concentration of formamide in the hybridization buffer is reduced in5% increments from 50% to 0% to identify clones having decreasing levelsof homology to the probe. Following hybridization, the filter is washedwith 6×SSC, 0.5% SDS at 50° C. These conditions are considered to be“moderate” conditions above 25% formamide and “low” conditions below 25%formamide.

[0480] cDNAs or genomic DNAs which have hybridized to the probe areidentified by autoradiography.

[0481] If it is desired to obtain nucleic acids homologous to the PG1gene, such as allelic variants thereof or nucleic acids encodingproteins related to the PG1 protein, the level of homology between thehybridized nucleic acid and the PG1 gene may readily be determined. Todetermine the level of homology between the hybridized nucleic acid andthe PG1 gene, the nucleotide sequences of the hybridized nucleic acidand the PG1 gene are compared. For example, using the above methods,nucleic acids having at least 95% nucleic acid homology to the PG1 geneis obtained and identified. Similarly, by using progressively lessstringent hybridization conditions one can obtain and identify nucleicacids having at least 90%, at least 85%, at least 80% or at least 75%homology to the PG1 gene.

[0482] To determine whether a clone encodes a protein having a givenamount of homology to the PG1 protein, the amino acid sequence of thePG1 protein is compared to the amino acid sequence encoded by thehybridizing nucleic acid. Homology is determined to exist when an aminoacid sequence in the PG1 protein is closely related to an amino acidsequence in the hybridizing nucleic acid. A sequence is closely relatedwhen it is identical to that of the PG1 sequence or when it contains oneor more amino acid substitutions therein in which amino acids havingsimilar characteristics have been substituted for one another. Using theabove methods, one can obtain nucleic acids encoding proteins having atleast 95%, at least 90%, at least 85%, at least 80% or at least 75%homology to the proteins encoded by the PG1 probe.

Isolation and Use of Mutant or Low Frequency PG1 Alleles from MammalianProstate Tumor Tissues and Cell lines

[0483] A single mutant PG1 gene was isolated from a human prostatecancer cell line. The nucleic acid sequence and amino acid sequence ofthis mutant PG1 are disclosed in SEQ IN NOs: 69 and 70, respectively.This mutant was found to contain a stop codon at codon position number229, and therefore results in a truncated gene product of only 228 aminoacids. The present invention encompasses purified or isolated nucleicacids comprising at least 8, 10, 12, 15, 20, or 25 consecutivenucleotides of SEQ ID NO: 69, preferably containing the mutation incodon number 229. A preferred embodiment of the present inventionencompasses purified or isolated nucleic acids comprising at least 8,10, 12, 15, 20, or 25 consecutive nucleotides of SEQ ID NO: 71.

[0484] The present invention is also directed to methods of determiningwhether an individual is at risk of developing prostate cancer at alater date or whether said individual suffers from prostate cancer as aresult of a mutation in the PG1 gene comprising: obtaining a nucleicacid sample from said individual; and determining whether thenucleotides present at one or more of the polymorphic bases in thesequences selected from the group consisting of SEQ ID NOs: 69 and 71are indicative of a risk of developing prostate cancer at a later dateor indicative of prostate cancer resulting from a mutation in the PG1gene. The present invention also includes purified or isolated nucleicacids encoding at least 4, 8, 10, 12, 15, or 20 consecutive amino acidsof the polypeptide of SEQ ID NO: 70, preferably including the carboxyterminus of said polypeptide. The isolated or purified polypeptides ofthe invention include polypeptides comprising at least 4, 8, 10, 12, 15,or 20 consecutive amino acids of the polypeptide of SEQ ID NO: 70,preferably including the carboxy terminus of said polypeptide.

[0485] V. DIAGNOSIS OF INDIVIDUALS AT RISK FOR DEVELOPING PROSTATECANCER OR INDIVIDUALS SUFFERING FROM PROSTATE CANCER AS A RESULT OF AMUTATION IN THE PG1 GENE

[0486] Individuals may then be screened for the presence ofpolymorphisms in the PG1 gene or protein which are associated with adetectable phenotype such as cancer, prostate cancer or other conditionsas described in Example 14, below. The individuals is screened whilethey are asymptomatic to determine their risk of developing cancer,prostate cancer or other conditions at a subsequent time. Alternatively,individuals suffering from cancer, prostate cancer or other conditionsis screened for the presence of polymorphisms in the PG1 gene or proteinin order to determine whether therapies which target the PG1 gene orprotein should be applied.

EXAMPLE 14

[0487] Nucleic acid samples are obtained from a symptomatic orasymptomatic individual. The nucleic acid samples is obtained from bloodcells as described above or is obtained from other tissues or organs.For individuals suffering from prostate cancer, the nucleic acid sampleis obtained from the tumor. The nucleic acid sample may comprise DNA,RNA, or both. The nucleotides at positions in the PG1 gene wheremutations lead to prostate cancer or other detectable phenotypes aredetermined for the nucleic acid sample.

[0488] In one embodiment, a PCR amplification is conducted on thenucleic acid sample as described above to amplify regions in whichpolymorphisms associated with prostate cancer or other detectablephenotypes have been identified. The amplification products aresequenced to determine whether the individual possesses one or more PG1polymorphisms associated with prostate cancer or other detectablephenotypes.

[0489] Alternatively, the nucleic acid sample is subjected tomicrosequencing reactions as described above to determine whether theindividual possesses one or more PG1 polymorphisms associated withprostate cancer or another detectable phenotype resulting from amutation in the PG1 gene.

[0490] In another embodiment, the nucleic acid sample is contacted withone or more allele specific oligonucleotides which specificallyhybridize to one or more PG1 alleles associated with prostate cancer oranother detectable phenotype. The nucleic acid sample is also contactedwith a second PG1 oligonucleotide capable of producing an amplificationproduct when used with the allele specific oligonucleotide in anamplification reaction. The presence of an amplification product in theamplification reaction indicates that the individual possesses one ormore PG1 alleles associated with prostate cancer or another detectablephenotype.

Determination of PG1 Expression Levels

[0491] As discussed above, PG1 polymorphisms associated with cancer,prostate cancer or other detectable phenotypes may exert their effectsby increasing, decreasing, or eliminating PG1 expression, or in alteringthe frequency of various transcription species. Accordingly, PG1expression levels in individuals suffering from cancer, prostate canceror other detectable phenotypes is compared to those of unaffectedindividuals to determine whether over-expression, under-expression, lossof expression, or changes in the relative frequency of transcriptionspecies of PG1 causes cancer, prostate cancer or another detectablephenotype. Individuals is tested to determine whether they are at riskof developing cancer, or prostate cancer at a subsequent time or whetherthey suffer from prostate cancer resulting from a mutation in the PG1gene by determining whether they exhibit a level of PG1 expressionassociated with prostate cancer. Similarly, individuals is tested todetermine whether they suffer from another PG1 mediated detectablephenotype or whether they are at risk of suffering from such a conditionat a subsequent time.

[0492] Expression levels in nucleic acid samples from affected andunaffected individuals is determined by performing Northern blots usingdetectable probes derived from the PG1 gene or the PG1 cDNA. A varietyof conventional Northern blotting procedures is used to detect andquantitate PG1 expression and the frequencies of the varioustranscription species of PG1, including those disclosed in CurrentProtocols in Molecular Biology, John Wiley 503 Sons, Inc. 1997 andSambrook et al. Molecular Cloning: A Laboratory Manual, Second Edition,Cold Spring Harbor Laboratory Press, 1989.

[0493] Alternatively, PG1 expression levels is determined as describedin Example 15, below.

EXAMPLE 15

[0494] Expression levels and patterns of PG1 is analyzed by solutionhybridization with long probes as described in International PatentApplication No. WO 97/05277. Briefly, the PG1 cDNA or the PG1 genomicDNA described above, or fragments thereof, is inserted at a cloning siteimmediately downstream of a bacteriophage (T3, T7 or SP6) RNA polymerasepromoter to produce antisense RNA. Preferably, the PG1 insert comprisesat least 100 or more consecutive nucleotides of the genomic DNA sequenceof SEQ ID NO: 1 or the cDNA sequences of SEQ ID NO: 3. The plasmid islinearized and transcribed in the presence of ribonucleotides comprisingmodified ribonucleotides (i.e. biotin-UTP and DIG-UTP). An excess ofthis doubly labeled RNA is hybridized in solution with mRNA isolatedfrom cells or tissues of interest. The hybridizations are performedunder standard stringent conditions (40-50° C. for 16 hours in an 80%formamide, 0.4 M NaCl buffer, pH 7-8). The unhybridized probe is removedby digestion with ribonucleases specific for single-stranded RNA (i.e.RNases CL3, Ti, Phy M, U2 or A). The presence of the biotin-UTPmodification enables capture of the hybrid on a microtitration platecoated with streptavidin. The presence of the DIG modification enablesthe hybrid to be detected and quantified by ELISA using an anti-DIGantibody coupled to alkaline phosphatase.

[0495] Quantitative analysis of PG1 gene expression may also beperformed using arrays as described in Sections II and X,. As used here,the term array means an arrangement of a plurality of nucleic acids ofsufficient length to permit specific detection of expression of PG1mRNAs capable of hybridizing thereto. For example, the arrays maycontain a plurality of nucleic acids derived from genes whose expressionlevels are to be assessed. The arrays may include the PG1 genomic DNA ofSEQ ID NO: 179, the PG1 cDNA of SEQ ID NO: 3 or the sequencescomplementary thereto or fragments thereof. The array may contain someor all of the known alternative splice or transcription species of PG1,including the species in SEQ ID NOs: 3, and 112-124 to determine therelative frequency of particular transcription species. Alternatively,the array may contain polynucleotides which overlap all of the potentialsplice junctions, including, for example SEQ ID NOs: 137-178, so thatthe frequency of particular splice junctions can be determined andcorrelated with traits or used in diagnostics just as expressions levelsare. Preferably, the fragments are at least 15 nucleotides in length. Inother embodiments, the fragments are at least 25 nucleotides in length.In some embodiments, the fragments are at least 50 nucleotides inlength. More preferably, the fragments are at least 100 nucleotides inlength. In another preferred embodiment, the fragments are more than 100nucleotides in length. In some embodiments the fragments is more than500 nucleotides in length.

[0496] For example, quantitative analysis of PG1 gene expression isperformed with a complementary DNA microarray as described by Schena etal. (Science 270:467470, 1995; Proc. Natl. Acad. Sci. U.S.A.93:10614-10619, 1996). Full length PG1 cDNAs or fragments thereof areamplified by PCR and arrayed from a 96-well microtiter plate ontosilylated microscope slides using high-speed robotics. Printed arraysare incubated in a humid chamber to allow rehydration of the arrayelements and rinsed, once in 0.2% SDS for 1 min, twice in water for 1min and once for 5 min in sodium borohydride solution. The arrays aresubmerged in water for 2 min at 95° C., transferred into 0.2% SDS for 1min, rinsed twice with water, air dried and stored in the dark at 25° C.

[0497] Cell or tissue mRNA is isolated or commercially obtained andprobes are prepared by a single round of reverse transcription. Probesare hybridized to 1 cm² microarrays under a 14×14 mm glass coverslip for6-12 hours at 60° C. Arrays are washed for 5 min at 25° C. in lowstringency wash buffer (1×SSC/0.2% SDS), then for 10 min at roomtemperature in high stringency wash buffer (0.1×SSC/0.2% SDS). Arraysare scanned in 0.1×SSC using a fluorescence laser scanning device fittedwith a custom filter set. Accurate differential expression measurementsare obtained by taking the average of the ratios of two independenthybridizations.

[0498] Quantitative analysis of PG1 gene expression may also beperformed with full length PG1 cDNAs or fragments thereof incomplementary DNA arrays as described by Pietu et al. (Genome Research6:492-503, 1996). The full length PG1 cDNA or fragments thereof is PCRamplified and spotted on membranes. Then, mRNAs originating from varioustissues or cells are labeled with radioactive nucleotides. Afterhybridization and washing in controlled conditions, the hybridized mRNAsare detected by phospho-imaging or autoradiography. Duplicateexperiments are performed and a quantitative analysis of differentiallyexpressed mRNAs is then performed.

[0499] Alternatively, expression analysis using the PG1 genomic DNA, thePG1 cDNA, or fragments thereof can be done through high densitynucleotide arrays as described by Lockhart et al. (Nature Biotechnology14: 1675-1680, 1996) and Sosnowsky et al. (Proc. Natl. Acad. Sci.94:1119-1123, 1997). Oligonucleotides of 15-50 nucleotides from thesequences of the PG1 genomic DNA of SEQ ID NO: 179, the PG1 cDNA of SEQID NO: 3, 112-124 or the sequences complementary thereto, aresynthesized directly on the chip (Lockhart et al., supra) or synthesizedand then addressed to the chip (Sosnowski et al., supra).

[0500] PG1 cDNA probes labeled with an appropriate compound, such asbiotin, digoxigenin or fluorescent dye, are synthesized from theappropriate mRNA population and then randomly fragmented to an averagesize of 50 to 100 nucleotides. The said probes are then hybridized tothe chip. After washing as described in Lockhart et al., supra andapplication of different electric fields (Sosnowsky et al., Proc. Natl.Acad. Sci. 94:1119-1123)., the dyes or labeling compounds are detectedand quantified. Duplicate hybridizations are performed. Comparativeanalysis of the intensity of the signal originating from cDNA probes onthe same target oligonucleotide in different cDNA samples indicates adifferential expression of PG1 mRNA.

[0501] The above methods may also be used to determine whether anindividual exhibits a PG1 expression pattern associated with cancer,prostate cancer or other diseases. In such methods, nucleic acid samplesfrom the individual are assayed for PG1 expression as described above.If a PG1 expression pattern associated with cancer, prostate cancer, oranother disease is observed, an appropriate diagnosis is rendered andappropriate therapeutic techniques which target the PG1 gene or proteinis applied.

[0502] The above methods may also be applied using allele specificprobes to determine whether an individual possesses a PG1 alleleassociated with cancer, prostate cancer, or another disease. In suchapproaches, one or more allele specific oligonucleotides containingpolymorphic nucleotides in the PG1 gene which are associated withprostate cancer are fixed to a microarray. The array is contacted with anucleic acid sample from the individual being tested under conditionswhich permit allele specific hybridization of the sample nucleic acid tothe allele specific PG1 probes. Hybridization of the sample nucleic acidto one or more of the allele specific PG1 probes indicates that theindividual suffers from prostate cancer caused by the PG1 gene or thatthe individual is at risk for developing prostate cancer at a subsequenttime. Alternatively, any of the genotyping methods described in SectionX is utilized.

Use of the Biallelic Markers Of The Invention In Diagnostics

[0503] The biallelic markers of the present invention can also be usedto develop diagnostics tests capable of identifying individuals whoexpress a detectable trait as the result of a specific genotype orindividuals whose genotype places them at risk of developing adetectable trait at a subsequent time.

[0504] The diagnostic techniques of the present invention may employ avariety of methodologies to determine whether a test subject has abiallelic marker pattern associated with an increased risk of developinga detectable trait or whether the individual suffers from a detectabletrait as a result of a particular mutation, including methods whichenable the analysis of individual chromosomes for haplotyping, such asfamily studies, single sperm DNA analysis or somatic hybrids. The traitanalyzed using the present diagnostics is any detectable trait, cancer,prostate cancer or another disease, a response to an anti-cancer, oranti-prostate cancer, or side effects to an anti-cancer or anti-prostatecancer agent. Diagnostics, which analyze and predict response to a drugor side effects to a drug, is used to determine whether an individualshould be treated with a particular drug. For example, if the diagnosticindicates a likelihood that an individual will respond positively totreatment with a particular drug, the drug is administered to theindividual. Conversely, if the diagnostic indicates that an individualis likely to respond negatively to treatment with a particular drug, analternative course of treatment is prescribed. A negative response isdefined as either the absence of an efficacious response or the presenceof toxic side effects.

[0505] Clinical drug trials represent another application for themarkers of the present invention. One or more markers indicative ofresponse to an anti-cancer or anti-prostate cancer agent or to sideeffects to an anti-cancer or anti-prostate cancer agent is identifiedusing the methods described in Section XI, below. Thereafter, potentialparticipants in clinical trials of such an agent is screened to identifythose individuals most likely to respond favorably to the drug andexclude those likely to experience side effects. In that way, theeffectiveness of drug treatment is measured in individuals who respondpositively to the drug, without lowering the measurement as a result ofthe inclusion of individuals who are unlikely to respond positively inthe study and without risking undesirable safety problems. Preferably,in such diagnostic methods, a nucleic acid sample is obtained from theindividual and this sample is genotyped using methods described inSection X.

[0506] Another aspect of the present invention relates to a method ofdetermining whether an individual is at risk of developing a trait orwhether an individual expresses a trait as a consequence of possessing aparticular trait-causing allele. The present invention relates to amethod of determining whether an individual is at risk of developing aplurality of traits or whether an individual expresses a plurality oftraits as a result of possessing a particular trait-causing allele.These methods involve obtaining a nucleic acid sample from theindividual and determining whether the nucleic acid sample contains oneor more alleles of one or more biallelic markers indicative of a risk ofdeveloping the trait or indicative that the individual expresses thetrait as a result of possessing a particular trait-causing allele.

[0507] As described herein, the diagnostics is based on a singlebiallelic marker or a group of biallelic markers.

[0508] VI. ASSAYING THE PG1 PROTEIN FOR INVOLVEMENT IN RECEPTOR/LIGANDINTERACTIONS

[0509] The expressed PG1 protein or portion thereof is evaluated forinvolvement in receptor/ligand interactions as described in Example 16below.

EXAMPLE 16

[0510] The proteins encoded by the PG1 gene or a portion thereof mayalso be evaluated for their involvement in receptor/ligand interactions.Numerous assays for such involvement are familiar to those skilled inthe art, including the assays disclosed in the following references:Chapter 7.28 (Measurement of Cellular Adhesion under Static Conditions7.28.1-7.28.22) in Current Protocols in Immunology, J. E. Coligan et al.Eds. Greene Publishing Associates and Wiley-Interscience; Takai et al.,Proc. Natl. Acad. Sci. USA 84:6864-6868, 1987; Bierer et al., J. Exp.Med. 168:1145-1156, 1988; Rosenstein et al., J. Exp. Med. 169:149-160,1989; Stoltenborg et al., J. Immunol. Methods 175:59-68, 1994; Stitt etal., Cell 80:661-670, 1995; Gyuris et al., Cell 75:791-803, 1993.

[0511] For example, the proteins of the present invention maydemonstrate activity as receptors, receptor ligands or inhibitors oragonists of receptor/ligand interactions. Examples of such receptors andligands include, without limitation, cytokine receptors and theirligands, receptor kinases and their ligands, receptor phosphatases andtheir ligands, receptors involved in cell-cell interactions and theirligands (including without limitation, cellular adhesion molecules (suchas sclectins, integrins and their ligands) and receptor/ligand pairsinvolved in antigen presentation, antigen recognition and development ofcellular and humoral immune responses). Receptors and ligands are alsouseful for screening of potential peptide or small molecule inhibitorsof the relevant receptor/ligand interaction. A protein of the presentinvention (including, without limitation, fragments of receptors andligands) may themselves be useful as inhibitors of receptor/ligandinteractions.

[0512] The PG1 protein or portions thereof described above is used indrug screening procedures to identify molecules which are agonists,antagonists, or inhibitors of PG1 activity. The PG1 protein or portionthereof used in such analyses is free in solution or linked to a solidsupport. Alternatively, PG1 protein or portions thereof can be expressedon a cell surface. The cell may naturally express the PG1 protein orportion thereof or, alternatively, the cell may express the PG1 proteinor portion thereof from an expression vector such as those describedbelow.

[0513] In one method of drug screening, eukaryotic or prokaryotic hostcells which are stably transformed with recombinant polynucleotides inorder to express the PG1 protein or a portion thereof are used inconventional competitive binding assays or standard direct bindingassays. For example, the formation of a complex between the PG1 proteinor a portion thereof and the agent being tested is measured in directbinding assays. Alternatively, the ability of a test agent to preventformation of a complex between the PG1 protein or a portion thereof anda known ligand is measured.

[0514] Alternatively, the high throughput screening techniques disclosedin the published PCT application WO 84/03564, is used. In suchtechniques, large numbers of small peptides to be tested for PG1 bindingactivity are synthesized on a surface and affixed thereto. The testpeptides are contacted with the PG1 protein or a portion thereof,followed by a wash step. The amount of PG1 protein or portion thereofwhich binds to the test compound is quantitated using conventionaltechniques.

[0515] In some methods, PG1 protein or a portion thereof is fixed to asurface and contacted with a test compound. After a washing step, theamount of test compound which binds to the PG1 protein or portionthereof is measured.

[0516] In another approach, the three dimensional structure of the PG1protein or a portion thereof may be determined and used for rationaldrug design.

[0517] Alternatively, the PG1 protein or a portion thereof is expressedin a host cell using expression vectors such as those described herein.The PG1 protein or portion thereof is an isotype which is associatedwith prostate cancer or an isotype which is not associated with prostatecancer. The cells expressing the PG1 protein or portion thereof arecontacted with a series of test agents and the effects of the testagents on PG1 activity are measured. Test agents which modify PG1activity is employed in therapeutic treatments.

[0518] The above procedures may also be applied to evaluate mutant PG1proteins responsible for a detectable phenotype.

Identification of Proteins which Interact with the PG1 Protein

[0519] Proteins which interact with the PG1 protein is identified asdescribed in Example 17, below.

EXAMPLE 17

[0520] Proteins which interact with the PG1 protein or a portionthereof, is identified using two hybrid systems such as the MatchmakerTwo Hybrid System 2 (Catalog No. K1604-1, Clontech). As described in themanual accompanying the Matchmaker Two Hybrid System 2 (Catalog No.K1604-1, Clontech), nucleic acids encoding the PG1 protein or a portionthereof, are inserted into an expression vector such that they are inframe with DNA encoding the DNA binding domain of the yeasttranscriptional activator GAL4. cDNAs in a cDNA library which encodeproteins which might interact with the polypeptides encoded by thenucleic acids encoding the PG1 protein or a portion thereof are insertedinto a second expression vector such that they are in frame with DNAencoding the activation domain of GAL4. The two expression plasmids aretransformed into yeast and the yeast are plated on selection mediumwhich selects for expression of selectable markers on each of theexpression vectors as well as GAL4 dependent expression of the HIS3gene. Transformants capable of growing on medium lacking histidine arescreened for GALA dependent lacZ expression. Those cells which arepositive in both the histidine selection and the lacZ assay containplasmids encoding proteins which interact with the polypeptide encodedby the nucleic acid inserts.

[0521] Alternatively, the system described in Lustig et al., Methods inEnzymology 283: 83-99 (1997), is used for identifying molecules whichinteract with the PG1 protein or a portion thereof In such systems, invitro transcription reactions are performed on vectors containing aninsert encoded the PG1 protein or a portion thereof cloned downstream ofa promoter which drives in vitro transcription. The resulting mRNA isintroduced into Xenopus laevis oocytes. The oocytes are then assayed fora desired activity.

[0522] Alternatively, the in vitro transcription products produced asdescribed above is translated in vitro. The in vitro translationproducts can be assayed for a desired activity or for interaction with aknown polypeptide.

[0523] The system described in U.S. Pat. No. 5,654,150 may also be usedto identify molecules which interact with the PG1 protein or a portionthereof. In this system, pools of cDNAs are transcribed and translatedin vitro and the reaction products are assayed for interaction with aknown polypeptide or antibody.

[0524] Proteins or other molecules interacting with the PG1 protein orportions thereof can be found by a variety of additional techniques. Inone method, affinity columns containing the PG1 protein or a portionthereof can be constructed. In some versions of this method the affinitycolumn contains chimeric proteins in which the PG1 protein or a portionthereof is fused to glutathione S-transferase. A mixture of cellularproteins or pool of expressed proteins as described above is applied tothe affinity column. Proteins interacting with the polypeptide attachedto the column can then be isolated and analyzed on 2-D electrophoresisgel as described in Ramunsen et al. Electrophoresis, 18, 588-598 (1997).Alternatively, the proteins retained on the affinity column can bepurified by electrophoresis based methods and sequenced. The same methodcan be used to isolate antibodies, to screen phage display products, orto screen phage display human antibodies.

[0525] Proteins interacting with the PG1 protein or portions thereof canalso be screened by using an Optical Biosensor as described in Edwardset Leatherbarrow, Analytical Biochemistry, 246, 1-6 (1997). The mainadvantage of the method is that it allows the determination of theassociation rate between the protein and other interacting molecules.Thus, it is possible to specifically select interacting molecules with ahigh or low association rate. Typically a target molecule is linked tothe sensor surface (through a carboxymethl dextran matrix) and a sampleof test molecules is placed in contact with the target molecules. Thebinding of a test molecule to the target molecule causes a change in therefractive index and/ or thickness. This change is detected by theBiosensor provided it occurs in the evanescent field (which extend a fewhundred nanometers from the sensor surface). In these screening assays,the target molecule can be the PG1 protein or a portion thereof and thetest sample can be a collection of proteins extracted from tissues orcells, a pool of expressed proteins, combinatorial peptide and/orchemical libraries, or phage displayed peptides. The tissues or cellsfrom which the test proteins are extracted can originate from anyspecies.

[0526] In other methods, a target protein is immobilized and the testpopulation is the PG1 protein or a portion thereof.

[0527] To study the interaction of the PG1 protein or a portion thereofwith drugs, the microdialysis coupled to HPLC method described by Wanget al., Chromatographia, 44, 205-208(1997) or the affinity capillaryelectrophoresis method described by Busch et al., J. Chromatogr.777:311-328 (1997).

[0528] The above procedures may also be applied to evaluate mutant PG1proteins responsible for a detectable phenotype.

[0529] VII. PRODUCTION OF ANTIBODIES AGAINST PG1 POLYPEPTIDES

[0530] Any PG1 polypeptide or whole protein (SEQ ID NOs: 4, 5, 70, 74,125-136) whether human, mouse or mammalian is used to generateantibodies capable of specifically binding to expressed PG1 protein orfragments thereof as described in Example 16, below. The antibodies iscapable of binding the fall length PG1 protein. PG1 proteins whichresult from naturally occurring mutant, particularly functional mutantsof PG1, including SEQ ID NO: 70, which may used in the production ofantibodies. The present invention also contemplates the use ofpolypeptides comprising a contiguous stretch of at least 6 amino acids,preferably at least 8 to 10 amino acids, more preferably at least 12,15, 20, 25, 50, or 100 amino acids of any PG1 protein in the manufactureof antibodies. In a preferred embodiment the contiguous stretch of aminoacids comprises the site of a mutation or functional mutation, includinga deletion, addition, swap or truncation of the amino acids in the PG1protein sequence. For instance, polypeptides that contain either the Argand His residues at amino acid position 184, and polypeptides thatcontain either the Arg or Ile residue at amino acid position 293 of theSEQ ID NO: 4 in said contiguous stretch are particularly preferredembodiments of the invention and useful in the manufacture of antibodiesto detect the presence and absence of these mutations. Similarly,polypeptides with a carboxy terminus at position 228 is a particularlypreferred embodiment of the invention and useful in the manufacture ofantibodies to detect the presence and absence of the mutation shown inSEQ ID NOs: 69 and 70. Similarly, polypeptides that that contain anpeptide sequences of 8, 10, 12, 15, or 25 amino acids encoded over anaturally-occurring splice junction (the point at which two human PG1exon (SEQ ID NOs: 100-111) are covalently linked) in said contiguousstretch are particularly preferred embodiments and useful in themanufacture of antibodies to detect the presence, localization, andquantity of the various protein products of the PG1 alternative splicespecies.

[0531] Alternatively, the antibodies is screened so as to isolate thosewhich are capable of binding an epitope-containing fragment of at least8, 10, 12, 15, 20, 25, or 30 amino acids of a human, mouse, or mammalianPG1 protein, preferably a sequence selected from SEQ ID NOs: 4, 5, 70,74, or 125-136.

[0532] Antibodies may also be generated which are capable ofspecifically binding to a given isoform of the PG1 protein. For example,the antibodies is capable of specifically binding to an isoform of thePG1 protein which causes prostate cancer or another detectable phenotypewhich has been obtained as described above and expressed from anexpression vector as described above. Alternatively, the antibodies iscapable of binding to an isoform of the PG1 protein which does not causeprostate cancer. Such antibodies is used in diagnostic assays in whichprotein samples from an individual are evaluated for the presence of anisoform of the PG1 protein which causes cancer or another detectablephenotype using techniques such as Western blotting or ELISA assays.

[0533] Non-human animals or mammals, whether wild-type or transgenic,which express a different species of PG1 than the one to which antibodybinding is desired, and animals which do not express PG1 (i.e. an PG1knock out animal as described in Section VIII.) are particularly usefulfor preparing antibodies. PG1 knock out animals will recognize all ormost of the exposed regions of PG1 as foreign antigens, and thereforeproduce antibodies with a wider array of PG1 epitopes. The humoralimmune system of animals which produce a species of PG1 that resemblesthe antigenic sequence will preferentially recognize the differencesbetween the animal's native PG1 species and the antigen sequence, andproduce antibodies to these unique sites in the antigen sequence.

Preferred Antibodies That Bind PG-1 Polypeptides of the Invention

[0534] Any PG-1 polypeptide or whole protein may be used to generateantibodies capable of specifically binding to an expressed PG-1 proteinor fragments thereof as described.

[0535] One antibody composition of the invention is capable ofspecifically binding or specifically bind to the variant of the PG-1protein of SEQ ID No 4. For an antibody composition to specifically bindto a first variant of PG-1, it must demonstrate at least a 5%, 10%, 15%,20%, 25%, 50%, or 100% greater binding affinity for a full length firstvariant of the PG-1 protein than for a full length second variant of thePG-1 protein in an ELISA, RIA, or other antibody-based binding assay.

[0536] In a preferred embodiment, the invention concerns antibodycompositions, either polyclonal or monoclonal, capable of selectivelybinding, or selectively bind to an epitope-containing a polypeptidecomprising a contiguous span of at least 6 amino acids, preferably atleast 8 to 10 amino acids, more preferably at least 12, 15, 20, 25, 30,40, 50, or 100 amino acids of SEQ ID No 4, wherein said epitopecomprises at least 1, 2, 3, or 5 of the amino acid positions 1-26,295-302, and 333-353.

[0537] The invention also concerns a purified or isolated antibodycapable of specifically binding to a mutated PG-1 protein or to afragment or variant thereof comprising an epitope of the mutated PG-1protein. In another preferred embodiment, the present invention concernsan antibody capable of binding to a polypeptide comprising at least 10consecutive amino acids of a PG-1 protein and including at least one ofthe amino acids which can be encoded by the trait causing mutations.

[0538] In a preferred embodiment, the invention concerns the use in themanufacture of antibodies of a polypeptide comprising a contiguous spanof at least 6 amino acids, preferably at least 8 to 10 amino acids, morepreferably at least 12, 15, 20, 25, 30, 40, 50, or 100 amino acids ofSEQ ID No 4, wherein said contiguous span comprises at least 1, 2, 3, or5 of the amino acid positions 1-26, 295-302, and 333-353.

EXAMPLE 18

[0539] Substantially pure protein or polypeptide is isolated fromtransfected or transformed cells containing an expression vectorencoding the PG1 protein or a portion thereof as described in Example11. The concentration of protein in the final preparation is adjusted,for example, by concentration on an Amicon filter device, to the levelof a few micrograms/ml. Monoclonal or polyclonal antibody to the proteincan then be prepared as follows:

A. Monoclonal Antibody Production by Hybridoma Fusion

[0540] Monoclonal antibody to epitopes in the PG1 protein or a portionthereof can be prepared from murine hybridomas according to theclassical method of Kohler, G. and Milstein, C., Nature 256:495 (1975)or derivative methods thereof. Also see Harlow, E., and D. Lane. 1988.Antibodies A Laboratory Manual. Cold Spring Harbor Laboratory. pp.53-242.

[0541] Briefly, a mouse is repetitively inoculated with a few microgramsof the PG1 protein or a portion thereof over a period of a few weeks.The mouse is then sacrificed, and the antibody producing cells of thespleen isolated. The spleen cells are fused by means of polyethyleneglycol with mouse myeloma cells, and the excess unfused cells destroyedby growth of the system on selective media comprising aminopterin (HATmedia). The successfully fused cells are diluted and aliquots of thedilution placed in wells of a microtiter plate where growth of theculture is continued. Antibody-producing clones are identified bydetection of antibody in the supernatant fluid of the wells byimmunoassay procedures, such as ELISA, as originally described byEngvall, E., Meth. Enzymol. 70:419 (1980), and derivative methodsthereof. Selected positive clones can be expanded and their monoclonalantibody product harvested for use. Detailed procedures for monoclonalantibody production are described in Davis, L. et al. Basic Methods inMolecular Biology Elsevier, New York. Section 21-2.

B. Polyclonal Antibody Production by Immunization

[0542] Polyclonal antiserum containing antibodies to heterogeneousepitopes in the PG1 protein or a portion thereof can be prepared byimmunizing suitable non-human animal with the PG1 protein or a portionthereof, which can be unmodified or modified to enhance immunogenicity.A suitable non-human animal is preferably a non-human mammal isselected, usually a mouse, rat, rabbit, goat, or horse. Alternatively, acrude preparation which has been enriched for PG1 concentration can beused to generate antibodies. Such proteins, fragments or preparationsare introduced into the non-human mammal in the presence of anappropriate adjuvant (e.g. aluminum hydroxide, RIBI, etc.) which isknown in the art. In addition the protein, fragment or preparation canbe pretreated with an agent which will increase antigenicity, suchagents are known in the art and include, for example, methylated bovineserum albumin (mBSA), bovine serum albumin (BSA), Hepatitis B surfaceantigen, and keyhole limpet hemocyanin (KLH). Serum from the immunizedanimal is collected, treated and tested according to known procedures.If the serum contains polyclonal antibodies to undesired epitopes, thepolyclonal antibodies can be purified by immunoaffinity chromatography.

[0543] Effective polyclonal antibody production is affected by manyfactors related both to the antigen and the host species. Also, hostanimals vary in response to site of inoculations and dose, with bothinadequate or excessive doses of antigen resulting in low titerantisera. Small doses (ng level) of antigen administered at multipleintradermal sites appears to be most reliable. Techniques for producingand processing polyclonal antisera are known in the art, see forexample, Mayer and Walker (1987). An effective immunization protocol forrabbits can be found in Vaitukaitis, J. et al. J. Clin. Endocrinol.Metab. 33:988-991 (1971).

[0544] Booster injections can be given at regular intervals, andantiserum harvested when antibody titer thereof, as determinedsemi-quantitatively, for example, by double immunodiffusion in agaragainst known concentrations of the antigen, begins to fall. See, forexample, Ouchterlony, O. et al., Chap. 19 in: Handbook of ExperimentalImmunology D. Wier (ed) Blackwell (1973). Plateau concentration ofantibody is usually in the range of 0.1 to 0.2 mg/ml of serum (about 12μM). Affinity of the antisera for the antigen is determined by preparingcompetitive binding curves, as described, for example, by Fisher, D.,Chap. 42 in: Manual of Clinical Immunology, 2d Ed. (Rose and Friedman,Eds.) Amer. Soc. For Microbiol., Washington, D.C. (1980).

[0545] Antibody preparations prepared according to either the monoclonalor the polyclonal protocol are useful in quantitative immunoassays whichdetermine concentrations of antigen-bearing substances in biologicalsamples; they are also used semi-quantitatively or qualitatively toidentify the presence of antigen in a biological sample. The antibodiesmay also be used in therapeutic compositions for killing cellsexpressing the protein or reducing the levels of the protein in thebody.

[0546] VIII. VECTORS AND THE USES OF POLYNUCLEOTIDES IN CELLS, ANIMALS,AND HUMANS

[0547] The nucleic acids of the invention include expression vectors,amplification vectors, PCR-suitable polynucleotide primers, and vectorswhich are suitable for the introduction of a polynucleotide of theinvention into an embryonic stem cells for the production of transgenicnon-human animals. In addition, vectors which are suitable for theintroduction of a polynucleotide of the invention into cells, organs andindividuals, including human individuals, for the purposes of genetherapy to reduce the severity of or prevent genetic diseases associatedwith functional mutations in PG1 genes are encompassed by the presentinvention. Functional mutations in PG1 genes which are suitable astargets for the gene therapy and transgenic vectors and methods of theinvention include, but are not limited to, mutations in the codingregion of the PG1 gene which affect the amino acid sequence of the PG1gene's product, mutations in the promoter or other regulatory regionswhich affect the levels of PG1 expression, mutations in the PG1 splicesites which affect length of the PG1 gene product or the relativefrequency of PG1 alternative splicing species, and any other mutationwhich in any way affects the level or quality of PG1 expression oractivity. The gene therapy methods can be achieved by targeting vectorsand method for changing a mutant PG1 gene into a wild-type PG1 gene in aembryonic stem cell or somatic cell. Alternatively, the presentinvention also encompasses methods and vectors for introducing theexpression of wild-type PG1 sequences without the disruption of anymutant PG1 which already reside in the cell, organ or individual.

[0548] The invention also embodies amplification vectors, which comprisea polynucleotide of the invention, and an origin of replication.Preferably, such amplification vectors further comprise restrictionendonuclease sites flanking the polynucleotide, so as to facilitatecleavage and purification of the polynucleotides from the remainder ofthe amplification vector, and a selectable marker, so as to facilitateamplification of the amplification vector. Most preferably, therestriction endonuclease sites in the amplification vector are situatedsuch that cleavage at those site would result in no other amplificationvector fragments of a similar size.

[0549] Thus, such an amplification vector is transfected into a hostcell compatible with the origin of replication of said amplificationvector, wherein the host cell is a prokaryotic or eukaryotic cell,preferably a mammalian, insect, yeast, or bacterial cell, mostpreferably an Escherichia coli cell. The resulting transfected hostcells is grown by culture methods known in the art, preferably underselection compatible with the selectable marker (e.g., antibiotics). Theamplification vectors can be isolated and purified by methods known inthe art (e.g., standard plasmid prep procedures). The polynucleotide ofthe invention can be cleaved with restriction enzymes that specificallycleave at the restriction endonuclease sites flanking thepolynucleotide, and the double-stranded polynucleotide fragment purifiedby techniques known in the art, including gel electrophoresis.

[0550] Alternatively linear polynucleotides comprising a polynucleotideof the invention is amplified by PCR. The PCR method is well known inthe art and described in, e.g., U.S. Pat. Nos. 4,683,195 and 4,683,202and Saiki, R et al. 1988. Science 239:487491, and European patentapplications 86302298.4, 86302299.2 and 87300203.4, as well as Methodsin Enzymology 1987 155:335-350.

[0551] The polynucleotides of the invention can also be derivatized invarious ways, including those appropriate for facilitating transfectionand/or gene therapy. The polynucleotides can be derivatized by attachinga nuclear localization signal to it to improve targeted delivery to thenucleus. One well-characterized nuclear localization signal is theheptapeptide PKKKRKV (pro-lys-lys-lys-arg-lys-val). Preferably, in thecase of polynucleotides in the form of a closed circle, the nuclearlocalization signal is attached via a modified loop nucleotide or spacerthat forms a branching structure.

[0552] If it is to be used in vivo, the polynucleotide of the inventionis derivatized to include ligands and/or delivery vehicles which providedispersion through the blood, targeting to specific cell types, orpermit easier transit of cellular barriers. Thus, the polynucleotides ofthe invention is linked or combined with any targeting or delivery agentknown in the art, including but not limited to, cell penetrationenhancers, lipofectin, liposomes, dendrimers, DNA intercalators, andnanoparticles. In particular, nanoparticles for use in the delivery ofthe polynucleotides of the invention are particles of less than about 50nanometers diameter, nontoxic, non-antigenic, and comprised of albuminand surfactant, or iron as in the nanoparticle particle technology ofSynGenix. In general the delivery vehicles used to target thepolynucleotides of the invention may further comprise any cell specificor general targeting agents known in the art, and will have a specifictrapping efficiency to the target cells or organs of from about 5 toabout 35%.

[0553] The polynucleotides of the invention is used ex vivo in a genetherapy method for obtaining cells or organs which produce wild-type PG1or PG1 proteins which have been selectively mutated. The cells arecreated by incubation of the target cell with one or more of theabove-described polynucleotides under standard conditions for uptake ofnucleic acids, including electroporation or lipofection. In practicingan ex vivo method of treating cells or organs, the concentration ofpolynucleotides of the invention in a solution prepare to treat targetcells or organs is from about 0.1 to about 100 μM, preferably 0.5 to 50μM, most preferably from 1 to 10 μM.

[0554] Alternatively, the oligonucleotides can be modified orco-administered for targeted delivery to the nucleus. Improvedoligonucleotide stability is expected in the nucleus due to: (1) lowerlevels of DNases and RNases; and (2) higher oligonucleotideconcentrations due to lower total volume.

[0555] Alternatively, the polynucleotides of the invention can becovalently bonded to biotin to form a biotin-polynucleotide prodrug bymethods known in the art, and co-administered with a receptor ligandbound to avidin or receptor specific antibody bound to avidin, whereinthe receptor is capable of causing uptake of the resultingpolynucleotide-biotin-avidin complex into the cells. Receptors thatcause uptake are known to those of skill in the art.

[0556] The invention encompasses vectors which are suitable for theintroduction of a polynucleotide of the invention into an embryonic stemcell for the production of transgenic non-human animals, which in turnresult in the expression of recombinant PG1 in the transgenic animal.Any appropriate vector system can be used for the introduction andexpression of PG1 in transgenic animals, including for example yeastartificial chromosomes (YAC), bacterial artificial chromosomes (BAC),bacteriophage P1, and other vectors known in the art which are able toaccommodate sufficiently large inserts to encode the PG1 protein ordesired fragments thereof. Selected alterations, additions and deletionsin the PG1 gene may optionally be achieved by site-directed mutagenesis.Once an appropriate vector system is chosen, the site-directedmutagenesis process may then be conducted by techniques well known inthe art, and the fragment be returned and ligated to the larger vectorfrom which it was cleaved. For site directed mutagenesis methods see,for example, Kunkel, T. 1985. Proc. Natl. Acad. Sci. U.S.A. 82:488;Bandeyar, M. et al. 1988. Gene 65: 129-133; Nelson, M., and M.McClelland 1992. Methods Enzymol. 216:279-303; Weiner, M. 1994. Gene151: 119-123; Costa, G. and M. Weiner. 1994. Nucleic Acids Res. 22:2423; Hu, G. 1993. DNA and Cell Biology 12:763-770; and Deng, W. and J.Nickoff. 1992. Anal. Biochem. 200:81.

[0557] Briefly, the transgenic technology used herein involves theinactivation, addition or replacement of a portion of the PG1 gene orthe entire gene. For example the present technology includes theaddition of PG1 genes with or without the inactivation of the non-humananimal's native PG1 genes, as described in the preceding two paragraphsand in the Examples. The invention also encompasses the use of vectors,and the vectors themselves which target and modify an existing human PG1gene in a stem cell, whether it is contained in a non-human animal cellwhere it was previously introduced into the germ line by transgenictechnology or it is a native PG1 gene in a human pluripotent or somaticcell. This transgene technology usually relies on homologousrecombination in a pluripotent cell that is capable of differentiatinginto germ cell tissue. A DNA construct that encodes an altered region ofthe non-human animal's PG1 gene that contains, for instance a stop codonto destroy expression, is introduced into the nuclei of embryonic stemcells. Preferably mice are used for this transgenic work. In a portionof the cells, the introduced DNA recombines with the endogenous copy ofthe cell's gene, replacing it with the altered copy. Cells containingthe newly engineered genetic alteration are injected in a host embryo ofthe same species as the stem cell, and the embryo is reimplanted into arecipient female. Some of these embryos develop into chimericindividuals that posses germ cells entirely derived from the mutant cellline. Therefore, by breeding the chimeric progeny it is possible toobtain a new strain containing the introduced genetic alteration. SeeCapecchi 1989. Science. 244:1288-1292 for a review of this procedure.

[0558] The present invention encompasses the polynucleotides describedherein, as well as the methods for making these polynucleotidesincluding the method for creating a mutation in a human PG1 gene. Inaddition, the present invention encompasses cells which comprise thepolynucleotides of the invention, including but not limited toamplification host cells comprising amplification vectors of theinvention. Furthermore the present invention comprises the embryonicstem cells and transgenic non-human animals and mammals described hereinwhich comprise a gene encoding a human PG1 protein.

DNA construct that enables directing temporal and spatial geneexpression in recombinant host cells and in transgenic animals.

[0559] In order to study the physiological and phenotype consequences ofa lack of synthesis of the PG1 protein, both at the cellular level andat the multi-cellular organism level, in particular as regards todisorders related to abnormal cell proliferation, notably cancers, theinvention also encompasses DNA constructs and recombinant vectorsenabling a conditional expression of a specific allele of the PG1genomic sequence or cDNA and also of a copy of this genomic sequence orcDNA harboring substitutions, deletions, or additions of one or morebases as regards to the PG1 nucleotide sequence of SEQ ID NOs: 3,112-125, 179, 182-184, or a fragment thereof, these base substitutions,deletions or additions being located either in an exon, an intron or aregulatory sequence, but preferably in a 5′-regulatory sequence of amammalian PG1 gene, more preferably SEQ ID NO: 180 or in an exon of thePG1 genomic sequence or within the PG1 cDNA of SEQ ID NOs 3, 112-125, or184.

[0560] A first preferred DNA construct is based on the tetracyclineresistance operon tet from E. coli transposon Tn110 for controlling thePG1 gene expression, such as described by Gossen M. et al., 1992, Proc.Natl. Acad. Sci. USA, 89: 5547-5551; Gossen M. et al., 1995, Science,268: 1766-1769; and Furth P. A. et al., 1994, Proc. Natl Acad. Sci USA,91:9302-9306. Such a DNA construct contains seven tet operator sequencesfrom Tn10 (tetop) that are fused to either a minimal promoter or a5′-regulatory sequence of the PG1 gene, said minimal promoter or saidPG1 regulatory sequence being operably linked to a polynucleotide ofinterest that codes either for a sense or an antisense oligonucleotideor for a polypeptide, including a PG1 polypeptide or a peptide fragmentthereof. This DNA construct is functional as a conditional expressionsystem for the nucleotide sequence of interest when the same cell alsocomprises a nucleotide sequence coding for either the wild type (tTA) orthe mutant (rTA) repressor fused to the activating domain of viralprotein VP 16 of herpes simplex virus, placed under the control of apromoter, such as the HCMVIE 1 enhancer/promoter or the MMTV-LTR.Indeed, a preferred DNA construct of the invention will comprise boththe polynucleotide containing the tet operator sequences and thepolynucleotide containing a sequence coding for the tTA or the rTArepressor.

[0561] In the specific embodiment wherein the conditional expression DNAconstruct contains the sequence encoding the mutant tetracyclinerepressor rTA, the expression of the polynucleotide of interest issilent in the absence of tetracycline and induced in its presence.

DNA constructs allowing homologous recombination: replacement vectors

[0562] A second preferred DNA construct will comprise, from 5′-end to3′-end:(a) a first nucleotide sequence that is comprised of a PG1sequence preferably a PG1 genomic sequence; (b) a nucleotide sequencecomprising a positive selection marker, such as the marker for neomycinresistance (neo); and (c) a second nucleotide sequence that comprised ofa PG1 sequence preferably a PG1 genomic sequence, and is located on thegenome downstream the first PG1 nucleotide sequence (a).

[0563] In a preferred embodiment, this DNA construct also comprises anegative selection marker located upstream the nucleotide sequence (a)or downstream the nucleotide sequence (b). Preferably, the negativeselection marker consists of the thymidine kinase (tk) gene (Thomas K.R. et al., 1986, Cell, 44: 419-428), the hygromycin beta gene (Te Rieleet al., 1990, Nature, 348: 649-651), the hprt gene (Van der Lugt et al.,1991, Gene, 105: 263-267; and Reid L. H. et al., 1990, Proc. Natl. Acad.Sci. USA, 87: 4299-4303) or the Diphteria toxin A fragment (Dt-A) gene(Nada S. et al., 1993, Cell, 73: 1125-1135; Yagi T. et al., 1990, Proc.Natl. Acad. Sci. USA, 87: 9918-9922). Preferably, the positive selectionmarker is located within a PG1 exon sequence so as to interrupt thesequence encoding a PG1 protein.

[0564] These replacement vectors are described for example by Thomas K.R. et al., 1986, Cell, 44: 419-428; Thomas K. R. et al., 1987, Cell, 51:503-512; Mansour S. L. et al., 1988, Nature, 336: 348-352; and Koller etal., 1992, Annu. Rev. Immunol., 10: 705-30.

[0565] The first and second nucleotide sequences (a) and (c) is locatedat any point within a PG1 regulatory sequence, an intronic sequence, anexon sequence or a sequence containing both regulatory and/or intronicand/or exon sequences. The length of nucleotide sequences (a) and (c) isdetermined empirically by one of ordinary skill in the art. Nucleotidesequences (a) and (c) or any length are specifically contemplated in thepresent invention, however, lengths ranging from 1 kb to 50 kb,preferably from 1 kb to 10 kb, more preferably from 2 kb to 6 kb andmost preferably from 2 kb to 4 kb are normally used.

DNA constructs allowing homologous recombination:Cre-loxP system.

[0566] These new DNA constructs make use of the site-specificrecombination system of the P1 phage. The P1 phage possesses arecombinase called Cre which interacts specifically with a 34 base pairsloxP site. The loxP site is composed of two palindromic sequences of 13bp separated by a 8 bp conserved sequence (Hoess et al., 1986, NucleicAcids Res., 14: 2287-2300). The recombination by the Cre enzyme betweentwo loxP sites having an identical orientation leads to the deletion ofthe DNA fragment.

[0567] The Cre-loxP system used in combination with a homologousrecombination technique has been first described by Gu H. et al., 1993,Cell, 73: 1155-1164; and Gu H. et al., 1994, Science, 265: 103-106.Briefly, a nucleotide sequence of interest to be inserted in a targetedlocation of the genome harbors at least two loxP sites in the sameorientation and located at the respective ends of a nucleotide sequenceto be excised from the recombinant genome. The excision event requiresthe presence of the recombinase (Cre) enzyme within the nucleus of therecombinant host cell. The recombinase enzyme is brought at the desiredtime either by (a) incubating the recombinant host cells in a culturemedium containing this enzyme, by injecting the Cre enzyme directly intothe desired cell, such as described by Araki K. et al., 1995, Proc.Natl. Acad. Sci. USA, 92: 160-164; or by lipofection of the enzyme intothe cells, such as described by Baubonis et al., 1993, Nucleic AcidsRes., 21: 2025-2029; (b) transfecting the cell host with a vectorcomprising the Cre coding sequence operably linked to a promoterfunctional in the recombinant cell host, which promoter being optionallyinducible, said vector being introduced in the recombinant cell host,such as described by Gu H. et al., 1993, Cell, 73: 1155-1164; and SauerB. et al., 1988, Proc. Natl. Acad. Sci. USA, 85: 5166-5170; (c)introducing in the genome of the host cell a polynucleotide comprisingthe Cre coding sequence operably linked to a promoter functional in therecombinant cell host, which promoter is optionally inducible, and saidpolynucleotide being inserted in the genome of the cell host either by arandom insertion event or an homologous recombination event, such asdescribed by Gu H. et al., 1994, Science, 265: 103-106.

[0568] In the specific embodiment wherein the vector containing thesequence to be inserted in the PG1 gene by homologous recombination isconstructed in such a way that selectable markers are flanked by loxPsites of the same orientation, it is possible, by treatment by the Creenzyme, to eliminate the selectable markers while leaving the PG1sequences of interest that have been inserted by an homologousrecombination event. Again, two selectable markers are needed: apositive selection marker to select for the recombination event and anegative selection marker to select for the homologous recombinationevent. Vectors and methods using the Cre-loxP system are described byZou Y. R. et al., 1994, Curr. Biol., 4: 1099-1103.

[0569] Thus, a third preferred DNA construct of the invention comprises,from 5′-end to 3′-end: (a) a first nucleotide sequence that is comprisedof a PG1 sequence, preferably a PG1 genomic sequence; (b) a nucleotidesequence comprising a polynucleotide encoding a positive selectionmarker, such as the marker for neomycin resistance (neo), saidnucleotide sequence comprising additionally two sequences defining asite recognized by a recombinase, such as a loxP site, the two sitesbeing placed in the same orientation; and (c) a second nucleotidesequence that is comprised of a PG1 sequence, preferably a PG1 genomicsequence, and is located on the genome downstream of the first PG1nucleotide sequence (a).

[0570] The sequences defining a site recognized by a recombinase, suchas a loxP site, are preferably located within the nucleotide sequence(b) at suitable locations bordering the nucleotide sequence for whichthe conditional excision is sought. In one specific embodiment, two loxPsites are located at each side of the positive selection markersequence, in order to allow its excision at a desired time after theoccurrence of the homologous recombination event.

[0571] In a preferred embodiment of a method using the third DNAconstruct described above, the excision of the polynucleotide fragmentbordered by the two sites recognized by a recombinase, preferably twoloxP sites, is performed at a desired time, due to the presence withinthe genome of the recombinant host cell of a sequence encoding the Creenzyme operably linked to a promoter sequence, preferably an induciblepromoter, more preferably a tissue-specific promoter sequence and mostpreferably a promoter sequence which is both inducible andtissue-specific, such as described by Gu H. et al., 1994, Science, 265:103-106.

[0572] The presence of the Cre enzyme within the genome of therecombinant cell host may result of the breeding of two transgenicanimals, the first transgenic animal bearing the PG1-derived sequence ofinterest containing the loxp sites as described above and the secondtransgenic animal bearing the Cre coding sequence operably linked to asuitable promoter sequence, such as described by Gu H. et al., 1994,Science, 265: 103-106. Spatio-temporal control of the Cre enzymeexpression may also be achieved with an adenovirus based vector thatcontains the Cre gene thus allowing infection of cells, or in vivoinfection of organs, for delivery of the Cre enzyme, such as describedby Anton M. et al., 1995, J. Virol., 69:4600-4606; and Kanegae Y. etal., 1995, Nucl. Acids Res., 23: 3816-3821.

[0573] The DNA constructs described above is used to introduce a desirednucleotide sequence of the invention, preferably a PG1 genomic sequenceor a PG1 cDNA sequence, and most preferably an altered copy of a PG1genomic or cDNA sequence, within a predetermined location of thetargeted genome, leading either to the generation of an altered copy ofa targeted gene (knock-out homologous recombination) or to thereplacement of a copy of the targeted gene by another copy sufficientlyhomologous to allow an homologous recombination event to occur (knock-inhomologous recombination).

Nuclear antisense DNA constructs

[0574] Preferably, the antisense polynucleotides of the invention have a3′ polyadenylation signal that has been replaced with a self-cleavingribozyme sequence, such that RNA polymerase II transcripts are producedwithout poly(A) at their 3′ ends, these antisense polynucleotides beingincapable of export from the nucleus, such as described by Liu Z. etal., 1994, Proc. Natl. Acad. Sci. USA, 91: 45284262. In a preferredembodiment, these PG1 antisense polynucleotides also comprise, withinthe ribozyme cassette, a histone stem-loop structure to stabilizecleaved transcripts against 3′-5′ exonucleolytic degradation, such asdescribed by Eckner R. et al., 1991, EMBO J., 10: 3513-3522.

Expression Vectors

[0575] The polynucleotides of the invention also include expressionvectors. Expression vector systems, control sequences and compatiblehost are known in the art. For a review of these systems see, forexample, U.S. Pat. No. 5,350,671, columns 4548. Any of the standardmethods known to those skilled in the art for the insertion of DNAfragments into a vector is used to construct expression vectorscontaining a chimeric gene consisting of appropriatetranscriptional/translational control signals and the protein codingsequences. These methods may include in vitro recombinant DNA andsynthetic techniques and in vivo recombinants (genetic recombination).

[0576] Expression of a polypeptide, peptide or derivative, or analogsthereof encoded by a polynucleotide sequence in SEQ ID NOs: 3, 69,100-112, or 179-184 is regulated by a second nucleic acid sequence sothat the protein or peptide is expressed in a host transformed with therecombinant DNA molecule. For example, expression of a protein orpeptide is controlled by any promoter/enhancer element known in the art.Promoters which is used to control expression include, but are notlimited to, the CMV promoter, the SV40 early promoter region (Bernoistand Chambon, 1981, Nature 290:304-310), the promoter contained in the 3′long terminal repeat of Rous sarcoma virus (Yamamoto, et al., 1980, Cell22:787-797), the herpes thymidine kinase promoter (Wagner et al., 1981,Proc. Natl. Acad. Sci. U.S.A. 78:1441-1445), the regulatory sequences ofthe metallothionein gene (Brinster et al., 1982, Nature 296:39-42);prokaryotic expression vectors such as the beta-lactamase promoter(Villa-Kamaroff, et al., 1978, Proc. Natl. Acad. Sci. U.S.A.75:3727-3731), or the tac promoter (DeBoer, et al., 1983, Proc. Natl.Acad. Sci. U.S.A. 80:21-25); see also “Useful proteins from recombinantbacteria” in Scientific American, 1980, 242:74-94; plant expressionvectors comprising the nopaline synthetase promoter region(Herrera-Estrella et al., 1983, Nature 303:209-213) or the cauliflowermosaic virus 35S RNA promoter (Gardner, et al., 1981, Nucl. Acids Res.9:2871), and the promoter of the photosynthetic enzyme ribulosebiphosphate carboxylase (Herrera-Estrella et al., 1984, Nature310:115-120); promoter elements from yeast or other fungi such as theGal 4 promoter, the ADC (alcohol dehydrogenase) promoter, PGK(phosphoglycerol kinase) promoter, alkaline phosphatase promoter, andthe following animal transcriptional control regions, which exhibittissue specificity and have been utilized in transgenic animals:elastase I gene control region which is active in pancreatic acinarcells (Swift et al., 1984, Cell 38:639-646; Ornitz et al., 1986, ColdSpring Harbor Symp. Quant. Biol. 50:399-409; MacDonald, 1987, Hepatology7:425-515); insulin gene control region which is active in pancreaticbeta cells (Hanahan, 1985, Nature 315:115-122), immunoglobulin genecontrol region which is active in lymphoid cells (Grosschedl et al.,1984, Cell 38:647-658; Adames et al., 1985, Nature 318:533-538;Alexander et al., 1987, Mol. Cell. Biol. 7:1436-1444), mouse mammarytumor virus control region which is active in testicular, breast,lymphoid and mast cells (Leder et al., 1986, Cell 45:485-495), albumingene control region which is active in liver (Pinkert et al., 1987,Genes and Devel. 1:268-276), alpha-fetoprotein gene control region whichis active in liver (Krumlauf et al., 1985, Mol. Cell. Biol. 5:1639-1648;Hammer et al., 1987, Science 235:53-58; alpha I-antitrypsin gene controlregion which is active in the liver (Kelsey et al., 1987, Genes andDevel. 1:161-171), beta-globin gene control region which is active inmyeloid cells (Mogram et al., 1985, Nature 315:338-340; Kollias et al.,1986, Cell 46:89-94; myelin basic protein gene control region which isactive in oligodendrocyte cells in the brain (Readhead et al., 1987,Cell 48:703-712); myosin light chain-2 gene control region which isactive in skeletal muscle (Sani, 1985, Nature 314:283-286), andgonadotropic releasing hormone gene control region which is active inthe hypothalamus (Mason et al., 1986, Science 234:1372-1378).

[0577] Other suitable vectors, particularly for the expression of genesin mammalian cells, is selected from the group of vectors consisting ofP1 bacteriophages, and bacterial artificial chromosomes (BACs). Thesetypes of vectors may contain large inserts ranging from about 80-90 kb(PI bacteriophage) to about 300 kb (BACs).

P1 bacteriophage

[0578] The construction of P1 bacteriophage vectors such as p158 orp158/neo8 are notably described by Sternberg N. L., 1992, Trends Genet.,8: 1-16; and Sternberg N. L., 1994, Mamm. Genome, 5: 397-404.Recombinant P1 clones comprising PG1 nucleotide sequences is designedfor inserting large polynucleotides of more than 40 kb (Linton M. F. etal., 1993, J. Clin. Invest., 92: 3029-3037). To generate P1 DNA fortransgenic experiments, a preferred protocol is the protocol describedby McCormick et al., 1994, Genet. Anal. Tech. Appl., 11: 158-164.Briefly, E. coli (preferably strain NS3529) harboring the P1 plasmid aregrown overnight in a suitable broth medium containing 25 μg/ml ofkanamycin. The P1 DNA is prepared from the E. coli by alkaline lysisusing the Qiagen Plasmid Maxi kit (Qiagen, Chatsworth, Calif., USA),according to the manufacturer's instructions. The P1 DNA is purifiedfrom the bacterial lysate on two Qiagen-tip 500 columns, using thewashing and elution buffers contained in the kit. A phenol/chloroformextraction is then performed before precipitating the DNA with 70%ethanol. After solubilizing the DNA in TE (10 mM Tris-HCl, pH 7.4, 1 mMEDTA), the concentration of the DNA is assessed by spectrophotometry.

[0579] When the goal is to express a P1 clone comprising PG1 nucleotidesequences in a transgenic animal, typically in transgenic mice, it isdesirable to remove vector sequences from the P1 DNA fragment, forexample by cleaving the P1 DNA at rare-cutting sites within the P1polylinker (SfiI, NotI or SalI). The P1 insert is then purified fromvector sequences on a pulsed-field agarose gel, using methods similarusing methods similar to those originally reported for the isolation ofDNA from YACs (Schedl A. et al., 1993, Nature, 362: 258-261; andPeterson et al., 1993, Proc. Natl. Acad. Sci. USA, 90: 7593-7597). Atthis stage, the resulting purified insert DNA can be concentrated, ifnecessary, on a Millipore Ultrafree-MC Filter Unit (Millipore, Bedford,Mass., USA -30,000 molecular weight limit) and then dialyzed againstmicroinjection buffer (10 mM Tris-HCl, pH 7.4; 250 MEDTA) containing 100mM NaCl, 30 μM spermine, 70 μM spermidine on a microdyalisis membrane(type VS, 0.025 FM from Millipore). The intactness of the purified P1DNA insert is assessed by electrophoresis on 1% agarose (Sea Kem GTG;FMC Bio-products) pulse-field gel and staining with ethidium bromide.

Bacterial Artificial Chromosomes (BACs)

[0580] The bacterial artificial chromosome (BAC) cloning system (Shizuyaet al., 1992, Proc. Natl. Acad. Sci. USA, 89: 8794-8797) has beendeveloped to stably maintain large fragments of genomic DNA (100-300 kb)in E. coli. A preferred BAC vector consists of pBeloBAC11 vector thathas been described Kim U. J., et al., 1996, Genomics, 34: 213-218. BAClibraries are prepared with this vector using size-selected genomic DNAthat has been partially digested using enzymes that permit ligation intoeither the Bam HI or Hind III sites in the vector. Flanking thesecloning sites are T7 and SP6 RNA polymerase transcription initiationsites that can be used to generate end probes by either RNAtranscription or PCR methods. After the construction of a BAC library inE. coli, BAC DNA is purified from the host cell as a supercoiled circle.Converting these circular molecules into a linear form precedes bothsize determination and introduction of the BACs into recipient cells.The cloning site is flanked by two Not I sites, permitting clonedsegments to be excised from the vector by Not I digestion.Alternatively, the DNA insert contained in the pBeloBAC11 vector islinearized by treatment of the BAC vector with the commerciallyavailable enzyme lambda terminase that leads to the cleavage at theunique cosN site, but this cleavage method results in a full length BACclone containing both the insert DNA and the BAC sequences.

Host Cells

[0581] The PG1 gene expression in human cells is rendered defective, oralternatively it is proceeded with the insertion of a PG1 genomic orcDNA sequence with the replacement of the PG1 gene counterpart in thegenome of an animal cell by a PG1 polynucleotide according to theinvention. These genetic alterations is generated by homologousrecombination events using specific DNA constructs that have beenpreviously described.

[0582] One kind of host cell that is used are mammal zygotes, such asmurine zygotes. For example, murine zygotes may undergo microinjectionwith a purified DNA molecule of interest, for example a purified DNAmolecule that has previously been adjusted to a concentration range from1 ng/ml—for BAC inserts—3 ng/μl—for P1 bacteriophage inserts—in 10 mMTris-HCl, pH 7.4, 250 μM EDTA containing 100 mM NaCl, 30 μM spermine,and70 μM spermidine. When the DNA to be microinjected has a large size,polyamines and high salt concentrations can be used in order to avoidmechanical breakage of this DNA, as described by Schedl et al., 1993,Nucleic Acids Res., 21: 4783-4787.

[0583] Anyone of the polynucleotides of the invention, including the DNAconstructs described herein, is introduced in an embryonic stem (ES)cell line, preferably a mouse ES cell line. ES cell lines are derivedfrom pluripotent, uncommitted cells of the inner cell mass ofpre-implantation blastocysts. Preferred ES cell lines are the following:ES-E14TG2a (ATCC No. CRL-1821), ES-D3 (ATCC No. CRL1934 and No.CRL-11632), YS001 (ATCC No. CRL-11776), 36.5 (ATCC No. CRL-1 1116). Tomaintain ES cells in an uncommitted state, they are cultured in thepresence of growth inhibited feeder cells which provide the appropriatesignals to preserve this embryonic phenotype and serve as a matrix forES cell adherence. Preferred feeder cells consist of primary embryonicfibroblasts that are established from tissue of day 13-day 14 embryos ofvirtually any mouse strain, that are maintained in culture, such asdescribed by Abbondanzo S J et al., 1993, Methods in Enzymology,Academic Press, New York, pp. 803-823; and are inhibited in growth byirradiation, such as described by Robertson E., 1987, Embryo-derivedstem cell lines. E. J. Robertson Ed. Teratocarcinomas and embrionic stemcells: a practical approach. IRL Press, Oxford, pp. 71, or by thepresence of an inhibitory concentration of LIF, such as described byPease S. and William R. S., 1990, Exp. Cell. Res., 190: 209-211.

Transgenic Animals

[0584] The terms “transgenic animals” or “host animals” are used hereindesignate non-human animals that have their genome genetically andartificially manipulated so as to include one of the nucleic acidsaccording to the invention. Preferred animals are non-human mammals andinclude those belonging to a genus selected from Mus (e.g. mice), Rattus(e.g. rats) and Oryctogalus (e.g. rabbits) which have their genomeartificially and genetically altered by the insertion of a nucleic acidaccording to the invention.

[0585] The transgenic animals of the invention all include within aplurality of their cells a cloned recombinant or synthetic DNA sequence,more specifically one of the purified or isolated nucleic acidscomprising a PG1 coding sequence, a PG1 regulatory polynucleotide or aDNA sequence encoding an antisense polynucleotide such as described inthe present specification.

[0586] Preferred transgenic animals according to the invention containsin their somatic cells and/or in their germ line cells a polynucleotideselected from the following group of polynucleotides:

[0587] a) non-native, purified or isolated nucleic acid encoding a PG1polypeptide, or a polypeptide fragment or variant thereof.

[0588] b) a non-native, purified or isolated nucleic comprising at least8 consecutive nucleotides of the nucleotide sequence SEQ ID NOs: 179,182, or 183, a nucleotide sequence complementary; in some embodiments,the length of the fragments can range from at least 8, 10, 15, 20 or 30to 200 nucleotides, preferably from at least 10 to 50 nucleotides, morepreferably from at least 40 to 50 nucleotides of SEQ ID NOs: 179, 182,or 183, or the sequence complementary thereto. In some embodiments, thefragments may comprise more than 200 nucleotides of SEQ ID NOs: 179,182, or 183, or the sequence complementary thereto.

[0589] c) a non-native, purified or isolated nucleic acid comprising atleast 8 consecutive nucleotides of the nucleotide sequence SEQ ID NOs:3, 69, 112-125 or 184, a sequence complementary thereto or a variantthereof; In some embodiments, the length of the fragments can range fromat least 8, 10, 15, 20 or 30 to 200 nucleotides, preferably from atleast 10 to 50 nucleotides, more preferably from at least 40 to 50nucleotides of SEQ ID NOs: 3, 69, 112-125 or 184, or the sequencecomplementary thereto. In some embodiments, the fragments may comprisemore than 200 nucleotides of SEQ ID NOs: 3, 69, 112-125 or 184, or thesequence complementary thereto.

[0590] d) a non-native, purified or isolated nucleic acid comprising anucleotide sequence selected from the group of SEQ ID NOs: 100 to 111, asequence complementary thereto or a fragment or a variant thereof.

[0591] e) a non-native, purified or isolated nucleic acid comprising acombination of at least two polynucleotides selected from the groupconsisting of SEQ ID NOs: 100 to 111, or the sequences complementarythereto wherein the polynucleotides are arranged within the nucleicacid, from the 5′ end to the 3′ end of said nucleic acid, in the sameorder than in SEQ NOs: 179, 182, or 183.

[0592] f) a non-native, purified or isolated nucleic acid comprising thenucleotide sequence SEQ ID NO: 180, or the sequences complementarythereto or a biologically active fragment or variant of the nucleotidesequence of SEQ ID NO: 180, or the sequence complementary thereto.

[0593] g) a non-native, purified or isolated nucleic acid comprising thenucleotide sequence SEQ ID NO: 181, or the sequence complementarythereto or a biologically active fragment or variant of the nucleotidesequence of SEQ ID NO: 181 or the sequence complementary thereto.

[0594] h) a polynucleotide consisting of:

[0595] (1) a nucleic acid comprising a regulatory polynucleotide of SEQID NO: 180 or the sequences complementary thereto or a biologicallyactive fragment or variant thereof

[0596] (2) a polynucleotide encoding a desired polypeptide or nucleicacid.

[0597] (3) Optionally, a nucleic acid comprising a regulatorypolynucleotide of SEQ NO: 181, or the sequence complementary thereto ora biologically active fragment or variant thereof.

[0598] i) a DNA construct as described previously in the presentspecification.

[0599] The transgenic animals of the invention thus contain specificsequences of exogenous genetic material or “non-native” such as thenucleotide sequences described above in detail.

[0600] In a first preferred embodiment, these transgenic animals is goodexperimental models in order to study the diverse pathologies related tocell differentiation, in particular concerning the transgenic animalswithin the genome of which has been inserted one or several copies of apolynucleotide encoding a native PG1 protein, or alternatively a mutantPG1 protein.

[0601] In a second preferred embodiment, these transgenic animals mayexpress a desired polypeptide of interest under the control of theregulatory polynucleotides of the PG1 gene, leading to good yields inthe synthesis of this protein of interest, and eventually a tissuespecific expression of this protein of interest.

[0602] The design of the transgenic animals of the invention is madeaccording to the conventional techniques well known from the one skilledin the art. For more details regarding the production of transgenicanimals, and specifically transgenic mice, it is referred to Sandou etal. (1994) and also to U.S. Pat. Nos 4,873,191, issued Oct.10, 1989,5,464,764 issued Nov. 7, 1995 and 5,789,215, issued Aug. 4, 1998, thesedocuments being herein incorporated by reference to disclose methodsproducing transgenic mice.

[0603] Transgenic animals of the present invention are produced by theapplication of procedures which result in an animal with a genome thathas incorporated exogenous genetic material. The procedure involvesobtaining the genetic material, or a portion thereof, which encodeseither a PG1 coding sequence, a PG1 regulatory polynucleotide or a DNAsequence encoding a PG1 antisense polynucleotide such as described inthe present specification.

[0604] A recombinant polynucleotide of the invention is inserted into anembryonic or ES stem cell line. The insertion is preferably made usingelectroporation, such as described by Thomas K. R. et al., 1987, Cell,51: 503-512. The cells subjected to electroporation are screened (e.g.by selection via selectable markers, by PCR or by Southern blotanalysis) to find positive cells which have integrated the exogenousrecombinant polynucleotide into their genome, preferably via anhomologous recombination event. An illustrative positive-negativeselection procedure that is used according to the invention is describedby Mansour S. L. et al., 1988, Nature, 336: 348-352.

[0605] Then, the positive cells are isolated, cloned and injected into3.5 days old blastocysts from mice, such as described by Bradley A.,1987, Production and analysis of chimaeric mice. In: E. J. Robertson(Ed.), Teratocarcinomas and embryonic stem cells: A practical approach.IRL Press, Oxford, pp.113. The blastocysts are then inserted into afemale host animal and allowed to grow to term.

[0606] Alternatively, the positive ES cells are brought into contactwith embryos at the 2.5 days old 8-16 cell stage (morulae) such asdescribed by Wood S. A. et al., 1993, Proc. Natl. Acad. Sci. USA, 90:4582-4585; or by Nagy A. et al., 1993, Proc. Natl. Acad. Sci. USA, 90:8424-8428. The ES cells being internalized to colonize extensively theblastocyst including the cells which will give rise to the germ line.The offspring of the female host are tested to determine which animalsare transgenic e.g. include the inserted exogenous DNA sequence andwhich are wild-type.

[0607] Thus, the present invention also concerns a transgenic animalcontaining a nucleic acid, a recombinant expression vector or arecombinant host cell according to the invention.

Recombinant cell lines derived from the transgenic animals of theinvention.

[0608] A further object of the invention consists of recombinant hostcells obtained from a transgenic animal described herein.

[0609] Recombinant cell lines is established in vitro from cellsobtained from any tissue of a transgenic animal according to theinvention, for example by transfection of primary cell cultures withvectors expressing onc-genes such as SV40 large T antigen, as describedby Chou J. Y., 1989, Mol. Endocrinol., 3: 1511-1514; and Shay J. W. etal., 1991, Biochem. Biophys. Acta, 1072: 1-7.

Functional Analysis of the PG1 Poplypeptides In Transgenic Animals

[0610] Using different BACs that contain the PG1 gene, we performed FISHexperiment on the adenocarcinoma prostatic cell line PC3. Only onesignal could be detected showing that this region of chromosome 8 ishemizygous in this tumoral cell line.

[0611] To study the function of PG1, it is inactivate by homologousrecombination in the remaining allele of PG1 in the PC3 cell line. Toinactivate the remaining PG1 allele, a knock-out targeting vector isgenerated by inserting two genomic DNA fragments of 3.0 and 4.3 kb (thatcorrespond to a sequence upstream of the PG1 promoter and to part ofintron 1, respectively) in the pKO Scrambler Neo TK vector (Lexicon refV1901). Since the targeting vector contains the neomycine resistancegene as well as the Tk gene, homologous recombination is selected byadding geneticin and FIAU to the medium. The promoter, thetranscriptional start site, and the first ATG contained in exon 1 on therecombinant allele is deleted by homologous recombination between thetargeting vector and the remaining PG1 allele. Accordingly, no codingtranscripts is initiated from the recombinant allele. The parental PC3cells as well as cells hemizygous for the null allele are assessed fortheir phenotype, their growth rate in liquid culture, their ability togrow in agar (anchorage-independent growth) as well as their ability toform tumors and metastasis when injected subcutaneously in nude mice.

[0612] To determine the function of PG1 in the animal, and to generatean animal model for prostate tumorigenesis, mice in which tissuespecific inactivation of the PG1 alleles can be induced are generated.For this purpose, the Cre-loxP system is utilized as described above toallow chromosome engineering to be perform directly in the animal.

[0613] First, to generate mice with a conditional null allele, two loxPsites are introduced in the murine genome, the first one 5′ to the PG1promoter and the second one 3′ to the PG1 exon 1. Alternatively, togenerate subtle mutations or to specifically mutate some isoforms, theloxP sites are introduced so that they flank any of the given exons orany potential set of exons. It is important to note that a functionalPG1 messenger can be transcribed from these alleles until arecombination is triggered between the loxP sites by the Cre enzyme.

[0614] Second, to generate the inducer mice, the Cre gene is introducedin the mouse genome under the control of a tissue specific promoter, forexample under the control of the PSA (prostate specific antigen)promoter.

[0615] Finally, tissue specific inactivation of the PG1 gene are inducedby generating mice containing the Cre transgene that are homozygous forthe recombinant PG1 allele.

Gene Therapy

[0616] The present invention also comprises the use of the PG1 genomicDNA sequence of SEQ ID NO: 179, the PG1 cDNA of SEQ ID NO: 3, or nucleicacid encoding a mutant PG1 protein responsible for a detectablephenotype in gene therapy strategies, including antisense and triplehelix strategies as described in Examples 19 and 20, below. In antisenseapproaches, nucleic acid sequences complementary to an mRNA arehybridized to the mRNA intracellularly, thereby blocking the expressionof the protein encoded by the mRNA. The antisense sequences may preventgene expression through a variety of mechanisms. For example, theantisense sequences may inhibit the ability of ribosomes to translatethe mRNA. Alternatively, the antisense sequences may block transport ofthe mRNA from the nucleus to the cytoplasm, thereby limiting the amountof mRNA available for translation. Another mechanism through whichantisense sequences may inhibit gene expression is by interfering withmRNA splicing. In yet another strategy, the antisense nucleic acid isincorporated in a ribozyme capable of specifically cleaving the targetmRNA.

EXAMPLE 19 Preparation and Use of Antisense Oligonucleotides

[0617] The antisense nucleic acid molecules to be used in gene therapyis either DNA or RNA sequences. They may comprise a sequencecomplementary to the sequence of the PG1 genomic DNA of SEQ ID NO: 179,the PG1 cDNA of SEQ ID NO: 3, or a nucleic acid encoding a PG1 proteinresponsible for a detectable phenoytpe. The antisense nucleic acidsshould have a length and melting temperature sufficient to permitformation of an intracellular duplex having sufficient stability toinhibit the expression of the PG1 mRNA in the duplex. Strategies fordesigning antisense nucleic acids suitable for use in gene therapy aredisclosed in Green et al., Ann. Rev. Biochem. 55:569-597 (1986) andIzant and Weintraub, Cell 36:1007-1015 (1984).

[0618] In some strategies, antisense molecules are obtained by reversingthe orientation of the PG1 coding region with respect to a promoter soas to transcribe the opposite strand from that which is normallytranscribed in the cell. The antisense molecules is transcribed using invitro transcription systems such as those which employ T7 or SP6polymerase to generate the transcript. Another approach involvestranscription of PG1 antisense nucleic acids in vivo by operably linkingDNA containing the antisense sequence to a promoter in an expressionvector.

[0619] Alternatively, oligonucleotides which are complementary to thestrand of the PG1 gene normally transcribed in the cell is synthesizedin vitro. Thus, the antisense PG1 nucleic acids are complementary to thePG1 mRNA and are capable of hybridizing to the mRNA to create a duplex.In some embodiments, the PG1 antisense sequences may contain modifiedsugar phosphate backbones to increase stability and make them lesssensitive to RNase activity. Examples of modifications suitable for usein antisense strategies are described by Rossi et al., Pharmacol. Ther.50(2):245-254, (1991).

[0620] Various types of antisense oligonucleotides complementary to thesequence of the PG1 genomic DNA of SEQ ID NO: 179, the PG1 cDNA of SEQID NO: 3, or a nucleic acid encoding a PG1 protein responsible for adetectable phenoytpe is used. In one preferred embodiment, stable andsemi-stable antisense oligonucleotides as described in InternationalApplication No. PCT W094/23026, are used to inhibit the expression ofthe PG1 gene. In these molecules, the 3′ end or both the 3′ and 5′ endsare engaged in intramolecular hydrogen bonding between complementarybase pairs. These molecules are better able to withstand exonucleaseattacks and exhibit increased stability compared to conventionalantisense oligonucleotides.

[0621] In another preferred embodiment, the antisenseoligodeoxynucleotides described in International Application No. WO95/04141, are used to inhibit expression of the PG1 gene.

[0622] In yet another preferred embodiment, the covalently cross-linkedantisense oligonucleotides described in International Application No. WO96/31523, are used to inhibit expression of the PG1 gene. These double-or single-stranded oligonucleotides comprise one or more, respectively,inter- or intra-oligonucleotide covalent cross-linkages, wherein thelinkage consists of an amide bond between a primary amine group of onestrand and a carboxyl group of the other strand or of the same strand,respectively, the primary amine group being directly substituted in the2′ position of the strand nucleotide monosaccharide ring, and thecarboxyl group being carried by an aliphatic spacer group substituted ona nucleotide or nucleotide analog of the other strand or the samestrand, respectively.

[0623] The antisense oligodeoxynucleotides and oligonucleotidesdisclosed in International Application No. WO 92/18522, may also be usedto inhibit the expression of the PG1 gene. These molecules are stable todegradation and contain at least one transcription control recognitionsequence which binds to control proteins and are effective as decoystherefor. These molecules may contain “hairpin” structures, “dumbbell”structures, “modified dumbbell” structures, “cross-linked” decoystructures and “loop” structures.

[0624] In another preferred embodiment, the cyclic double-strandedoligonucleotides described in European Patent Application No. 0 572 287A2, are used to inhibit the expression of the PG1 gene. These ligatedoligonucleotide “dumbbells” contain the binding site for a transcriptionfactor which binds to the PG1 promoter and inhibits expression of thegene under control of the transcription factor by sequestering thefactor.

[0625] Use of the closed antisense oligonucleotides disclosed inInternational Application No. WO 92/19732, is also contemplated. Becausethese molecules have no free ends, they are more resistant todegradation by exonucleases than are conventional oligonucleotides.These oligonucleotides is multifunctional, interacting with severalregions which are not adjacent to the target mRNA.

[0626] The appropriate level of antisense nucleic acids required toinhibit PG1 gene expression is determined using in vitro expressionanalysis. The antisense molecule is introduced into the cells bydiffusion, injection, infection or transfection using procedures knownin the art. For example, the antisense nucleic acids can be introducedinto the body as a bare or naked oligonucleotide, oligonucleotideencapsulated in lipid, oligonucleotide sequence encapsidated by viralprotein, or as an oligonucleotide operably linked to a promotercontained in an expression vector. The expression vector is any of avariety of expression vectors known in the art, including retroviral orviral vectors, vectors capable of extrachromosomal replication, orintegrating vectors. The vectors is DNA or RNA.

[0627] The PG1 antisense molecules are introduced onto cell samples at anumber of different concentrations preferably between 1×10⁻¹⁰M to1×10⁻⁴M. Once the minimum concentration that can adequately control geneexpression is identified, the optimized dose is translated into a dosagesuitable for use in vivo. For example, an inhibiting concentration inculture of 1×10⁻⁷ translates into a dose of approximately 0.6 mg/kgbodyweight. Levels of oligonucleotide approaching 100 mg/kg bodyweightor higher is possible after testing the toxicity of the oligonucleotidein laboratory animals. It is additionally contemplated that cells fromthe vertebrate are removed, treated with the antisense oligonucleotide,and reintroduced into the vertebrate.

[0628] It is further contemplated that the PG1 antisense oligonucleotidesequence is incorporated into a ribozyme sequence to enable theantisense to specifically bind and cleave its target mRNA. For technicalapplications of ribozyme and antisense oligonucleotides see Rossi etal., supra.

[0629] In a preferred application of this invention, antibody-mediatedtests such as RIAs and ELISA, functional assays, or radiolabeling areused to determine the effectiveness of antisense inhibition on PG1expression.

[0630] The PG1 cDNA, the PG1 genomic DNA, and the PG1 alleles of thepresent invention may also be used in gene therapy approaches based onintracellular triple helix formation. Triple helix oligonucleotides areused to inhibit transcription from a genome. They are particularlyuseful for studying alterations in cell activity as it is associatedwith a particular gene. The PG1 cDNA, PG1 genomic DNA, or PG1 allele ofthe present invention or, more preferably, a portion of those sequences,can be used to inhibit gene expression in individuals suffering fromprostate cancer or another detectable phenotype or individuals at riskfor developing prostate cancer or another detectable phenotype at alater date as a result of their PG1 genotype. Similarly, a portion ofthe PG1 cDNA, the PG1 genomic DNA, or the PG1 alleles can be used tostudy the effect of inhibiting PG1 transcription within a cell.Traditionally, homopurine sequences were considered the most useful fortriple helix strategies, such as those described in Example 20, below.However, homopyrimidine sequences can also inhibit gene expression. Suchhomopyrimidine oligonucleotides bind to the major groove athomopurine:homopyrimidine sequences. Thus, both types of sequences fromthe PG1 cDNA, the PG1 genomic DNA, and the PG1 alleles are contemplatedwithin the scope of this invention.

EXAMPLE 20

[0631] The sequences of the PG1 cDNA, the PG1 genomic DNA, and the PG1alleles are scanned to identify 10-mer to 20-mer homopyrimidine orhomopurine stretches which could be used in triple-helix basedstrategies for inhibiting PG1 expression. Following identification ofcandidate homopyrimidine or homopurine stretches, their efficiency ininhibiting PG1 expression is assessed by introducing varying amounts ofoligonucleotides containing the candidate sequences into tissue culturecells which express the PG1 gene. The oligonucleotides is prepared on anoligonucleotide synthesizer or they is purchased commercially from acompany specializing in custom oligonucleotide synthesis, such asGENSET, Paris, France.

[0632] The oligonucleotides is introduced into the cells using a varietyof methods known to those skilled in the art, including but not limitedto calcium phosphate precipitation, DEAE-Dextran, electroporation,liposome-mediated transfection or native uptake.

[0633] Treated cells are monitored for altered cell function or reducedPG1 expression using techniques such as Northern blotting, RNaseprotection assays, or PCR based strategies to monitor the transcriptionlevels of the PG1 gene in cells which have been treated with theoligonucleotide.

[0634] The oligonucleotides which are effective in inhibiting geneexpression in tissue culture cells may then be introduced in vivo usingthe techniques described above and in Example 19 at a dosage calculatedbased on the in vitro results, as described in Example 19.

[0635] In some embodiments, the natural (beta) anomers of theoligonucleotide units can be replaced with alpha anomers to render theoligonucleotide more resistant to nucleases. Further, an intercalatingagent such as ethidium bromide, or the like, can be attached to the 3′end of the alpha oligonucleotide to stabilize the triple helix. Forinformation on the generation of oligonucleotides suitable for triplehelix formation see Griffin et al. (Science 245:967-971 (1989).

[0636] Alternatively, the PG1 cDNA, the PG1 genomic DNA, and the PG1alleles of the present invention is used in gene therapy approaches inwhich expression of the PG1 protein is beneficial, as described inExample 21 below.

Example 21

[0637] The PG1 cDNA, the PG1 genomic DNA, and the PG1 alleles of thepresent invention may also be used to express the PG1 protein or aportion thereof in a host organism to produce a beneficial effect. Insuch procedures, the PG1 protein is transiently expressed in the hostorganism or stably expressed in the host organism. The expressed PG1protein is used to treat conditions resulting from a lack of PG1expression or conditions in which augmentation of existing levels of PG1expression is beneficial.

[0638] A nucleic acid encoding the PG1 proteins of SEQ ID NO: 4, SEQ IDNO: 5, or a PG1 allele is introduced into the host organism. The nucleicacid is introduced into the host organism using a variety of techniquesknown to those of skill in the art. For example, the nucleic acid isinjected into the host organism as naked DNA such that the encoded PG1protein is expressed in the host organism, thereby producing abeneficial effect.

[0639] Alternatively, the nucleic acid encoding the PG1 proteins of SEQID NO: 4, SEQ ID NO: 5, or a PG1 allele is cloned into an expressionvector downstream of a promoter which is active in the host organism.The expression vector is any of the expression vectors designed for usein gene therapy, including viral or retroviral vectors.

[0640] The expression vector is directly introduced into the hostorganism such that the PG1 protein is expressed in the host organism toproduce a beneficial effect. In another approach, the expression vectoris introduced into cells in vitro. Cells containing the expressionvector are thereafter selected and introduced into the host organism,where they express the PG1 protein to produce a beneficial effect.

[0641] IX. ISOLATION OF PG1 cDNA FROM NONHUMAN MAMMALS

[0642] The present invention encompasses mammalian PG1 sequencesincluding genomic and cDNA sequences, as well as polypeptide sequences.The present invention also encompasses the use of PG1 genomic and cDNAsequences of the invention, including SEQ ID NOs: 179, 3, 182, and 183,in methods of isolating and characterizing PG1 nucleotide sequencesderived from nonhuman mammals, in addition to sequences derived fromhuman sequences. The human and mouse PG1 nucleic acid sequences of theinvention can be used to construct primers and probes for amplifying andidentifying PG1 genes in other nonhuman animals particularly mammals.The primers and probes used to identify nonhuman PG1 sequences isselected and used for the isolation of nonhuman PG1 utilizing the sametechniques described above in Examples 4, 5, 6, 12 and 13.

[0643] In addition, sequence analysis of other homologous proteins isused to optimize the sequences of these primers and probes. As describedabove in the Analysis of the PG1 Protein Sequence, three boxes ofhomology were identified in the structure of the PG1 protein productwhen compared to proteins from a diverse range of organisms. See FIG. 9.Using the assumption that the nucleotide sequences for these homologousproteins also show a high degree of homology, it is possible toconstruct primers that are specific for the PCR amplification of PG1cDNA in nonhuman mammals.

Example 22

[0644] AATCATCAAAGCACAGTTGACTGGAT (SEQ ID NO: 77) and BOXIIIer:ATAAACCACCGTAACATCATAAATTGCATCTAA (SEQ ID NO: 78)

[0645] The primers BOXIed: were designed as PCR primers from the humanPG1 sequences after comparison with the sequence homologies of FIG. 9.The BOXIed (SEQ ID NO: 77) and BOXIIIer (SEQ ID NO: 78) primers wereused to amplify a mouse PG1 cDNA sequence from mouse livermarathon-ready cDNA (Clontech) under the conditions described above inExample 4. This PCR reaction yielded a product of approximately 400 basepairs, the boxI-boxIII fragment, which was subjected to automateddideoxy terminator sequencing and electrophoresed on ABI 377 sequencersas described above. Sequence analysis confirmed very high homology tohuman PG1 both at the nucleic acid and protein levels.

[0646] Primers were designed for RACE analysis using the 400 base pairboxI-boxIII fragment. Further sequence information was obtained using 5′and 3′ RACE reactions on mouse liver marathon cDNA using two sets ofthese nested PCR primers: moPG1RACE5.350: AATCAAAAGCAACGTGAGTGGC (SEQ IDNO: 94) and moPG1RACE5.276: GCAAATGCCTGACTGGCTGA (SEQ ID NO: 93) for the5′ RACE reaction and moPG1RACE3.18: CTGCCAGACAGGATGCCCTA (SEQ ID NO: 90)and moPG1RACE3.63: ACAAGTTAAAATGGCTTCCGCTG (SEQ ID NO: 91) for the3′ RACE reaction.

[0647] The PCR products of the RACE reactions were sequenced by primerwalking using the following primers: moPGrace3S473: GAGATAAAAGATAGGTTGCT CA (SEQ ID NO: 79); moPGrace3S526: AAGAAACAAA TTTCCTGGG (SEQID NO: 80); moPGrace3S597: TCTTGGGGAG TTTGACTG (SEQ ID NO: 81);moPGrace5R323: GACCCCGGTG TAGTTCTC (SEQ ID NO: 82); moPGrace5R372:CAGTAAAGCC GGTCGTC (SEQ ID NO: 83); moPGrace5R444: CAGGCCAGCA GGTAGGT(SEQ ID NO: 84); moPGrace5R492: AGCAGGTAGC GCATAGAGT (SEQ ID NO: 85).

[0648] Again a high degree of homology between the mouse sequenceobtained from the primer walking and the human PG1 sequence wasobserved. An additional pair of nested primers were designed andutilized to further extend the 3′ mouse PG1 sequence in yet another RACEreaction, moPG3RACE2: TGGGCACCTG GTTGTATGGA (SEQ ID NO: 95) andmoPG3RACE2n: TCCTTGGCTG CCTGTGGTTT (SEQ ID NO:96).

[0649] The PCR product of this final RACE reaction was also sequenced byprimer walking using the following primers: moPG1RACE3R94: CAAATGCATGTTGGCTGT (SEQ ID NO: 92); moPG3RACES20: GATGGCTACA CATTGTATCA C (SEQ IDNO: 97); moPG3RACES5: TCCTGAATTA AATAAGGAGT TTTC (SEQ ID NO: 98);moPG3RACES90: GTTTGTTATT AAAGCATAAG CAAG (SEQ ID NO: 99).

[0650] The overlap in the 5′ RACE, boxI-boxIII, and 3′ RACE fragmentsallowed a single contiguous coding sequence for the mouse PG1 orthologto be generated alignment of the three fragments. Primers were chosenfrom near the 5′ and 3′ ends of this predicted contiguous sequence(contig) in order to confirm the existence of such a transcript. PCRamplification was performed again on mouse liver marathon-ready cDNA(Clontech) with the chosen primers, moPG15: TGGCGAGCCGAGAGGATG (SEQ IDNO: 87) and moPG13LR2: GGAAACAATGTGATACAATGTGTAGCC (SEQ ID NO: 86)

[0651] under the PCR conditions described above in Example 4. Theresulting PCR product was a roughly 1.2 kb DNA molecule and was shown tohave an identical sequence to that of the deduced contig. Finallymodified versions of the moPG15 and moPG13LR2 primers with the additionof EcoRI and BamHI sites, moPG15EcoRI: CGTGAATTCTGGCGAGCCGAGAGGATG (SEQID NO: 89) and moPG15Bam1: CGTGGATCCGGAAACAATGTGATACAATGTGTAGCC (SEQ IDNO: 88)

[0652] were used to obtain a PCR product that could be cloned into apSKBluescript plasmid (Stratagene) cleaved with EcoRI and BamHIrestriction enzymes. The mouse PG1 cDNA in the resulting construct wassubjected to automated dideoxy terminator sequencing and electrophoresedon ABI 377 sequencers as described above. The sequence for mouse PG1cDNA is reported in SEQ ID NO: 72, and the deduced amino acid sequencecorresponding to the cDNA is reported in SEQ ID NO: 74.

Example 23

[0653] A mouse BAC library was constructed by the cloning of BamHIpartially digested DNA of pluripotent embryonic stem cells, cell lineES-E14TG2a (ATCC CRL-1821) into pBeloBACII vector plasmid. Approximatelyfifty-six thousand clones with an average inset size of 120 kb werepicked individually and pooled for PCR screening as described above forhuman BAC library screening. These pools were screened with STS g34292derived from the region of the mouse PG1 transcript corresponding toexon6 of the human gene. The upstream and downstream primers definingthis STS are: upstream amplification primer for g34292: ATTAAAACACGTACTGACAC CA (SEQ ID NO: 75), and downstream amplification primer forg34292: AGTCATGGAT GGTGGATTT (SEQ ID NO: 76). BAC C0281H06 testedpositive for hybridizing to g34292. This BAC was isolated and sequencedby sub-cloning into pGenDel sequencing vector. The resulting partialgenomic sequence for mouse PG1 is reported in SEQ ID NO: 73. Thisprocess was repeated and the resulting partial genomic sequences formouse PG1 is reported in SEQ ID NOs: 182 and 183.

[0654] Other mammalian PG1 cDNA and genomic sequences can be isolated bythe methods of the present invention. PG1 genes in mammalian specieshave a region of at least 100, preferably 200, more preferably 500nucleotides in each mammal's most abundant transcription species whichhas at least 75%, preferably 85%, more preferably 95% sequence homologyto the most abundant human or mouse cDNA species (SEQ ID NO: 3). PG1proteins in mammalian species have a region of at least 40, preferably90, more preferably 160 amino acids in the deduced amino acid sequenceof the most abundant PG1 transcription species which has at least 75%,preferably 85%, more preferably 95% sequence homology to the deducedamino acid sequence of the most abundant human or mouse translationsspecies (SEQ ID NO: 4 or 74).

[0655] X. METHODS FOR GENOTYPlNG AN INDIVIDUAL FOR BIALLELIC MARKERS

[0656] Methods are provided to genotype a biological sample for one ormore biallelic markers of the present invention, all of which isperformed in vitro. Such methods of genotyping comprise determining theidentity of a nucleotide at an PG1-related biallelic marker by anymethod known in the art. These methods find use in genotypingcase-control populations in association studies as well as individualsin the context of detection of alleles of biallelic markers which, areknown to be associated with a given trait, in which case both copies ofthe biallelic marker present in individual's genome are determined sothat an individual is classified as homozygous or heterozygous for aparticular allele.

[0657] These genotyping methods can be performed nucleic acid samplesderived from a single individual or pooled DNA samples.

[0658] Genotyping can be performed using similar methods as thosedescribed above for the identification of the biallelic markers, orusing other genotyping methods such as those further described below. Inpreferred embodiments, the comparison of sequences of amplified genomicfragments from different individuals is used to identify new biallelicmarkers whereas microsequencing is used for genotyping known biallelicmarkers in diagnostic and association study applications.

[0659] X.A. Source of DNA for genotyping

[0660] Any source of nucleic acids, in purified or non-purified form,can be utilized as the starting nucleic acid, provided it contains or issuspected of containing the specific nucleic acid sequence desired. DNAor RNA is extracted from cells, tissues, body fluids. As for the sourceof genomic DNA to be subjected to analysis, any test sample can beforeseen without any particular limitation. These test samples includebiological samples, which can be tested by the methods of the presentinvention described herein, and include human and animal body fluidssuch as whole blood, serum, plasma, cerebrospinal fluid, urine, lymphfluids, and various external secretions of the respiratory, intestinaland genitourinary tracts, tears, saliva, milk, white blood cells,myelomas and the like; biological fluids such as cell culturesupernatants; fixed tissue specimens including tumor and non-tumortissue and lymph node tissues; bone marrow aspirates and fixed cellspecimens. The preferred source of genomic DNA used in the presentinvention is from peripheral venous blood of each donor. Techniques toprepare genomic DNA from biological samples are well known to theskilled technician. While nucleic acids for use in the genotypingmethods of the invention can be derived from any mammalian source, thetest subjects and individuals from which nucleic acid samples are takenare generally understood to be human.

[0661] X.B. Amplification Of DNA Fragments Comprising Biallelic Markers

[0662] Methods and polynucleotides are provided to amplify a segment ofnucleotides comprising one or more biallelic marker of the presentinvention. It will be appreciated that amplification of DNA fragmentscomprising biallelic markers is used in various methods and for variouspurposes and is not restricted to genotyping. Nevertheless, manygenotyping methods, although not all, require the previous amplificationof the DNA region carrying the biallelic marker of interest. Suchmethods specifically increase the concentration or total number ofsequences that span the biallelic marker or include that site andsequences located either distal or proximal to it. Diagnostic assays mayalso rely on amplification of DNA segments carrying a biallelic markerof the present invention.

[0663] Amplification of DNA is achieved by any method known in the art.The established PCR (polymerase chain reaction) method or bydevelopments thereof or alternatives. Amplification methods which can beutilized herein include but are not limited to Ligase Chain Reaction(LCR) as described in EP A 320 308 and EP A 439 182, Gap LCR (Wolcott,M. J., Clin. Mcrobiol. Rev. 5:370-386), the so-called “NASBA” or “3SR”technique described in Guatelli J. C. et al. (Proc. Natl. Acad. Sci. USA87:1874-1878, 1990) and in Compton J. (Nature 350:91-92, 1991), Q-betaamplification as described in European Patent Application no 4544610,strand displacement amplification as described in Walker et al. (Clin.Chem. 42:9-13, 1996) and EP A 684 315 and, target mediated amplificationas described in PCT Publication WO 9322461.

[0664] LCR and Gap LCR are exponential amplification techniques, bothdepend on DNA ligase to join adjacent primers annealed to a DNAmolecule. In Ligase Chain Reaction (LCR), probe pairs are used whichinclude two primary (first and second) and two secondary (third andfourth) probes, all of which are employed in molar excess to target. Thefirst probe hybridizes to a first segment of the target strand and thesecond probe hybridizes to a second segment of the target strand, thefirst and second segments being contiguous so that the primary probesabut one another in 5′ phosphate-3′ hydroxyl relationship, and so that aligase can covalently fuse or ligate the two probes into a fusedproduct. In addition, a third (secondary) probe can hybridize to aportion of the first probe and a fourth (secondary) probe can hybridizeto a portion of the second probe in a similar abutting fashion. Ofcourse, if the target is initially double stranded, the secondary probesalso will hybridize to the target complement in the first instance. Oncethe ligated strand of primary probes is separated from the targetstrand, it will hybridize with the third and fourth probes which can beligated to form a complementary, secondary ligated product. It isimportant to realize that the ligated products are functionallyequivalent to either the target or its complement. By repeated cycles ofhybridization and ligation, amplification of the target sequence isachieved. A method for multiplex LCR has also been described (WO9320227). Gap LCR (GLCR) is a version of LCR where the probes are notadjacent but are separated by 2 to 3 bases.

[0665] For amplification of mRNAs, it is within the scope of the presentinvention to reverse transcribe mRNA into cDNA followed by polymerasechain reaction (RT-PCR); or, to use a single enzyme for both steps asdescribed in U.S. Pat. No. 5,322,770 or, to use Asymmetric Gap LCR(RT-AGLCR) as described by Marshall R. L. et al. (PCR Methods andApplications 4:80-84, 1994). AGLCR is a modification of GLCR that allowsthe amplification of RNA.

[0666] Some of these amplification methods are particularly suited forthe detection of single nucleotide polymorphisms and allow thesimultaneous amplification of a target sequence and the identificationof the polymorphic nucleotide as it is further described in X.C.

[0667] The PCR technology is the preferred amplification technique usedin the present invention. A variety of PCR techniques are familiar tothose skilled in the art. For a review of PCR technology, see MolecularCloning to Genetic Engineering White, B. A. Ed. in Methods in MolecularBiology 67: Humana Press, Totowa (1997) and the publication entitled“PCR Methods and Applications” (1991, Cold Spring Harbor LaboratoryPress). In each of these PCR procedures, PCR primers on either side ofthe nucleic acid sequences to be amplified are added to a suitablyprepared nucleic acid sample along with dNTPs and a thermostablepolymerase such as Taq polymerase, Pfu polymerase, or Vent polymerase.The nucleic acid in the sample is denatured and the PCR primers arespecifically hybridized to complementary nucleic acid sequences in thesample. The hybridized primers are extended. Thereafter, another cycleof denaturation, hybridization, and extension is initiated. The cyclesare repeated multiple times to produce an amplified fragment containingthe nucleic acid sequence between the primer sites. PCR has further beendescribed in several patents including U.S. Pat. Nos. 4,683,195,4,683,202 and 4,965,188.

[0668] The identification of biallelic markers as described above allowsthe design of appropriate oligonucleotides, which can be used as primersto amplify DNA fragments comprising the biallelic markers of the presentinvention. Amplification can be performed using the primers initiallyused to discover new biallelic markers which are described herein or anyset of primers allowing the amplification of a DNA fragment comprising abiallelic marker of the present invention. Primers can be prepared byany suitable method. As for example, direct chemical synthesis by amethod such as the phosphodiester method of Narang S. A. et al. (MethodsEnzymol. 68:90-98, 1979), the phosphodiester method of Brown E. L. etal. (Methods Enzymol. 68:109-151, 1979), the diethylphosphoramiditemethod of Beaucage et al. (Tetrahedron Lett. 22:1859-1862, 1981) and thesolid support method described in EP 0 707 592.

[0669] In some embodiments the present invention provides primers foramplifying a DNA fragment containing one or more biallelic markers ofthe present invention. It will be appreciated that the amplificationprimers listed in the present specification are merely exemplary andthat any other set of primers which produce amplification productscontaining one or more biallelic markers of the present invention.

[0670] The primers are selected to be substantially complementary to thedifferent strands of each specific sequence to be amplified. The lengthof the primers of the present invention can range from 8 to 100nucleotides, preferably from 8 to 50, 8 to 30 or more preferably 8 to 25nucleotides. Shorter primers tend to lack specificity for a targetnucleic acid sequence and generally require cooler temperatures to formsufficiently stable hybrid complexes with the template. Longer primersare expensive to produce and can sometimes self-hybridize to formhairpin structures. The formation of stable hybrids depends on themelting temperature™ of the DNA. The Tm depends on the length of theprimer, the ionic strength of the solution and the G+C content. Thehigher the G+C content of the primer, the higher is the meltingtemperature because G:C pairs are held by three H bonds whereas A:Tpairs have only two. The G+C content of the amplification primers of thepresent invention preferably ranges between 10 and 75%, more preferablybetween 35 and 60%, and most preferably between 40 and 55%. Theappropriate length for primers under a particular set of assayconditions is empirically determined by one of skill in the art.

[0671] The spacing of the primers determines the length of the segmentto be amplified. In the context of the present invention amplifiedsegments carrying biallelic markers can range in size from at leastabout 25 bp to 35 kbp. Amplification fragments from 25-3000 bp aretypical, fragments from 50-1000 bp are preferred and fragments from100-600 bp are highly preferred. It will be appreciated thatamplification primers for the biallelic markers is any sequence whichallow the specific amplification of any DNA fragment carrying themarkers. Amplification primers is labeled or immobilized on a solidsupport as described in Section II.

[0672] X.C. Methods of Genotyping DNA samples for Biallelic Markers

[0673] Any method known in the art can be used to identify thenucleotide present at a biallelic marker site. Since the biallelicmarker allele to be detected has been identified and specified in thepresent invention, detection will prove routine for one of ordinaryskill in the art by employing any of a number of techniques. Manygenotyping methods require the previous amplification of the DNA regioncarrying the biallelic marker of interest. While the amplification oftarget or signal is often preferred at present, ultrasensitive detectionmethods which do not require amplification are also encompassed by thepresent genotyping methods. Methods well-known to those skilled in theart that can be used to detect biallelic polymorphisms include methodssuch as, conventional dot blot analyzes, single strand conformationalpolymorphism analysis (SSCP) described by Orita et al. (Proc. Natl.Acad. Sci. U.S.A 86:27776-2770, 1989), denaturing gradient gelelectrophoresis (DGGE), heteroduplex analysis, mismatch cleavagedetection, and other conventional techniques as described in Sheffield,V. C. et al. (Proc. Natl. Acad. Sci. USA 49:699-706, 1991), White et al.(Genomics 12:301-306, 1992), Grompe, M. et al. (Proc. Natl. Acad. Sci.USA 86:5855-5892, 1989) and Grompe, M. (Nature Genetics 5:111-117,1993). Another method for determining the identity of the nucleotidepresent at a particular polymorphic site employs a specializedexonuclease-resistant nucleotide derivative as described in U.S. Pat.No. 4,656,127.

[0674] Preferred methods involve directly determining the identity ofthe nucleotide present at a biallelic marker site by sequencing assay,allele-specific amplification assay, or hybridization assay. Thefollowing is a description of some preferred methods. A highly preferredmethod is the microsequencing technique. The term “sequencing assay” isused herein to refer to polymerase extension of duplex primer/templatecomplexes and includes both traditional sequencing and microsequencing.

[0675] 1) Sequencing assays

[0676] The nucleotide present at a polymorphic site can be determined bysequencing methods. In a preferred embodiment, DNA samples are subjectedto PCR amplification before sequencing as described above. Methods forsequencing DNA using either the dideoxy-mediated method (Sanger method)or the Maxam-Gilbert method are widely known to those of ordinary skillin the art. Such methods are for example disclosed in Maniatis et al.(Molecular Cloning, A Laboratory Manual, Cold Spring Harbor Press,Second Edition, 1989). Alternative approaches include hybridization tohigh-density DNA probe arrays as described in Chee et al. (Science 274,610, 1996).

[0677] Preferably, the amplified DNA is subjected to automated dideoxyterminator sequencing reactions using a dye-primer cycle sequencingprotocol. The products of the sequencing reactions are run on sequencinggels and the sequences are determined using gel image analysis.

[0678] The polymorphism detection in a pooled sample is based on thepresence of superimposed peaks in the electrophoresis pattern resultingfrom different bases occurring at the same position. Because eachdideoxy terminator is labeled with a different fluorescent molecule, thetwo peaks corresponding to a biallelic site present distinct colorscorresponding to two different nucleotides at the same position on thesequence. However, the presence of two peaks can be an artifact due tobackground noise. To exclude such an artifact, the two DNA strands aresequenced and a comparison between the peaks is carried out. In order tobe registered as a polymorphic sequence, the polymorphism has to bedetected on both strands.

[0679] The above procedure permits those amplification products, whichcontain biallelic markers to be identified. The detection limit for thefrequency of biallelic polymorphisms detected by sequencing pools of 100individuals is approximately 0.1 for the minor allele, as verified bysequencing pools of known allelic frequencies.

[0680] Microsequencing assays

[0681] In microsequencing methods, the nucleotide at a polymorphic sitein a target DNA is detected by a single nucleotide primer extensionreaction. This method involves appropriate microsequencing primerswhich, hybridize just upstream of the polymorphic base of interest inthe target nucleic acid. A polymerase is used to specifically extend the3′ end of the primer with one single ddNTP (chain terminator)complementary to the nucleotide at the polymorphic site. Next theidentity of the incorporated nucleotide is determined in any suitableway.

[0682] Typically, microsequencing reactions are carried out usingfluorescent ddNTPs and the extended microsequencing primers are analyzedby electrophoresis on ABI 377 sequencing machines to determine theidentity of the incorporated nucleotide as described in EP 412 883.Alternatively capillary electrophoresis can be used in order to processa higher number of assays simultaneously.

[0683] Different approaches can be used to detect the nucleotide addedto the microsequencing primer. A homogeneous phase detection methodbased on fluorescence resonance energy transfer has been described byChen and Kwok (Nucleic Acids Research 25:347-353 1997) and Chen et al.(Proc. Natl. Acad. Sci. USA 94/20 10756-10761,1997). In this methodamplified genomic DNA fragments containing polymorphic sites areincubated with a 5′-fluorescein-labeled primer in the presence ofallelic dye-labeled dideoxyribonucleoside triphosphates and a modifiedTaq polymerase. The dye-labeled primer is extended one base by thedye-terminator specific for the allele present on the template. At theend of the genotyping reaction, the fluorescence intensities of the twodyes in the reaction mixture are analyzed directly without separation orpurification. All these steps can be performed in the same tube and thefluorescence changes can be monitored in real time. Alternatively, theextended primer is analyzed by MALDI-TOF Mass Spectrometry. The base atthe polymorphic site is identified by the mass added onto themicrosequencing primer (see Haff L. A. and Smirnov I. P., GenomeResearch, 7:378-388, 1997).

[0684] Microsequencing is achieved by the established microsequencingmethod or by developments or derivatives thereof. Alternative methodsinclude several solid-phase microsequencing techniques. The basicmicrosequencing protocol is the same as described previously, exceptthat the method is conducted as a heterogenous phase assay, in which theprimer or the target molecule is immobilized or captured onto a solidsupport. To simplify the primer separation and the terminal nucleotideaddition analysis, oligonucleotides are attached to solid supports orare modified in such ways that permit affinity separation as well aspolymerase extension. The 5′ ends and internal nucleotides of syntheticoligonucleotides can be modified in a number of different ways to permitdifferent affinity separation approaches, e.g., biotinylation. If asingle affinity group is used on the oligonucleotides, theoligonucleotides can be separated from the incorporated terminatorregent. This eliminates the need of physical or size separation. Morethan one oligonucleotide can be separated from the terminator reagentand analyzed simultaneously if more than one affinity group is used.This permits the analysis of several nucleic acid species or morenucleic acid sequence information per extension reaction. The affinitygroup need not be on the priming oligonucleotide but could alternativelybe present on the template. For example, immobilization can be carriedout via an interaction between biotinylated DNA and streptavidin-coatedmicrotitration wells or avidin-coated polystyrene particles. In the samemanner oligonucleotides or templates is attached to a solid support in ahigh-density format. In such solid phase microsequencing reactions,incorporated ddNTPs can be radiolabeled (Syvänen, Clinica Chimica Acta226:225-236, 1994) or linked to fluorescein (Livak and Hainer, HumanMutation 3:379-385,1994). The detection of radiolabeled ddNTPs can beachieved through scintillation-based techniques. The detection offluorescein-linked ddNTPs can be based on the binding of antifluoresceinantibody conjugated with alkaline phosphatase, followed by incubationwith a chromogenic substrate (such as p-nitrophenyl phosphate). Otherpossible reporter-detection pairs include: ddNTP linked to dinitrophenyl(DNP) and anti-DNP alkaline phosphatase conjugate (Harju et al., Clin.Chem. 39/11 2282-2287, 1993) or biotinylated ddNTP and horseradishperoxidase-conjugated streptavidin with o-phenylenediamine as asubstrate (WO 92/15712). As yet another alternative solid-phasemicrosequencing procedure, Nyren et al. (Analytical Biochemistry208:171-175, 1993) described a method relying on the detection of DNApolymerase activity by an enzymatic luminometric inorganic pyrophosphatedetection assay (ELIDA).

[0685] Pastinen et al. (Genome research 7:606-614, 1997) describe amethod for multiplex detection of single nucleotide polymorphism inwhich the solid phase minisequencing principle is applied to anoligonucleotide array format. High-density arrays of DNA probes attachedto a solid support (DNA chips) are further described in X.C.5.

[0686] In one aspect the present invention provides polynucleotides andmethods to genotype one or more biallelic markers of the presentinvention by performing a microsequencing assay. It will be appreciatedthat any primer having a 3′ end immediately adjacent to the polymorphicnucleotide is used. However, polynucleotides comprising at least 8, 12,15, 20, 25, or 30 consecutive nucleotides of the sequence immediatelyadjacent to the biallelic marker and having a 3′ terminus immediatelyupstream of the corresponding biallelic marker are well suited fordetermining the identity of a nucleotide at biallelic marker site.

[0687] Similarly, it will be appreciated that microsequencing analysisis performed for any biallelic marker or any combination of biallelicmarkers of the present invention.

[0688] Mismatch detection assays based on polymerases and ligases

[0689] In one aspect the present invention provides polynucleotides andmethods to determine the allele of one or more biallelic markers of thepresent invention in a biological sample, by mismatch detection assaysbased on polymerases and/or ligases. These assays are based on thespecificity of polymerases and ligases. Polymerization reactions placesparticularly stringent requirements on correct base pairing of the 3′end of the amplification primer and the joining of two oligonucleotideshybridized to a target DNA sequence is quite sensitive to mismatchesclose to the ligation site, especially at the 3′ end. Methods, primersand various parameters to amplify DNA fragments comprising biallelicmarkers of the present invention are further described above in X.B.

[0690] Allele specific amplification

[0691] Discrimination between the two alleles of a biallelic marker canalso be achieved by allele specific amplification, a selective strategy,whereby one of the alleles is amplified without amplification of theother allele. This is accomplished by placing the polymorphic base atthe 3′ end of one of the amplification primers. Because the extensionforms from the 3′ end of the primer, a mismatch at or near this positionhas an inhibitory effect on amplification. Therefore, under appropriateamplification conditions, these primers only direct amplification ontheir complementary allele. Designing the appropriate allele-specificprimer and the corresponding assay conditions are well with the ordinaryskill in the art.

[0692] Ligation/amplification based methods

[0693] The “Oligonucleotide Ligation Assay” (OLA) uses twooligonucleotides which are designed to be capable of hybridizing toabutting sequences of a single strand of a target molecules. One of theoligonucleotides is biotinylated, and the other is detectably labeled.If the precise complementary sequence is found in a target molecule, theoligonucleotides will hybridize such that their termini abut, and createa ligation substrate that can be captured and detected. OLA is capableof detecting single nucleotide polymorphisms and is advantageouslycombined with PCR as described by Nickerson D. A. et al. (Proc. Natl.Acad. Sci. U.S.A. 87:8923-8927, 1990). In this method, PCR is used toachieve the exponential amplification of target DNA, which is thendetected using OLA.

[0694] Other methods which are particularly suited for the detection ofsingle nucleotide polymorphism include LCR (ligase chain reaction), GapLCR (GLCR) which are described above in X.B. As mentioned above LCR usestwo pairs of probes to exponentially amplify a specific target. Thesequences of each pair of oligonucleotides, is selected to permit thepair to hybridize to abutting sequences of the same strand of thetarget. Such hybridization forms a substrate for a template-dependantligase. In accordance with the present invention, LCR can be performedwith oligonucleotides having the proximal and distal sequences of thesame strand of a biallelic marker site. In one embodiment, eitheroligonucleotide will be designed to include the biallelic marker site.In such an embodiment, the reaction conditions are selected such thatthe oligonucleotides can be ligated together only if the target moleculeeither contains or lacks the specific nucleotide that is complementaryto the biallelic marker on the oligonucleotide. In an alternativeembodiment, the oligonucleotides will not include the biallelic marker,such that when they hybridize to the target molecule, a “gap” is createdas described in WO 90/01069. This gap is then “filled” withcomplementary dNTPs (as mediated by DNA polymerase), or by an additionalpair of oligonucleotides. Thus at the end of each cycle, each singlestrand has a complement capable of serving as a target during the nextcycle and exponential allele-specific amplification of the desiredsequence is obtained.

[0695] Ligase/Polymerase-mediated Genetic Bit Analysis™ is anothermethod for determining the identity of a nucleotide at a preselectedsite in a nucleic acid molecule (WO 95/21271). This method involves theincorporation of a nucleoside triphosphate that is complementary to thenucleotide present at the preselected site onto the terminus of a primermolecule, and their subsequent ligation to a second oligonucleotide. Thereaction is monitored by detecting a specific label attached to thereaction's solid phase or by detection in solution.

[0696] 2) Hybridization assay methods

[0697] A preferred method of determining the identity of the nucleotidepresent at a biallelic marker site involves nucleic acid hybridization.The hybridization probes, which can be conveniently used in suchreactions, preferably include the probes defined herein. Anyhybridization assay is used including Southern hybridization, Northernhybridization, dot blot hybridization and solid-phase hybridization (seeSambrook et al., Molecular Cloning—A Laboratory Manual, Second Edition,Cold Spring Harbor Press, N.Y., 1989).

[0698] Hybridization refers to the formation of a duplex structure bytwo single stranded nucleic acids due to complementary base pairing.Hybridization can occur between exactly complementary nucleic acidstrands or between nucleic acid strands that contain minor regions ofmismatch. Specific probes can be designed that hybridize to one form ofa biallelic marker and not to the other and therefore are able todiscriminate between different allelic forms. Allele-specific probes areoften used in pairs, one member of a pair showing perfect match to atarget sequence containing the original allele and the other showing aperfect match to the target sequence containing the alternative allele.Hybridization conditions should be sufficiently stringent that there isa significant difference in hybridization intensity between alleles, andpreferably an essentially binary response, whereby a probe hybridizes toonly one of the alleles. Stringent, sequence specific hybridizationconditions, under which a probe will hybridize only to the exactlycomplementary target sequence are well known in the art (Sambrook etal., Molecular Cloning—A Laboratory Manual, Second Edition, Cold SpringHarbor Press, N.Y., 1989). Stringent conditions are sequence dependentand will be different in different circumstances. Generally, stringentconditions are selected to be about 5° C. lower than the thermal meltingpoint™ for the specific sequence at a defined ionic strength and pH. Byway of example and not limitation, procedures using conditions of highstringency are as follows: Prehybridization of filters containing DNA iscarried out for 8 h to overnight at 65

C in buffer composed of 6×SSC, 50 mM Tris-HCl (pH 7.5), 1 mM EDTA, 0.02%PVP, 0.02% Ficoll, 0.02% BSA, and 500 μg/ml denatured salmon sperm DNA.Filters are hybridized for 48 h at 65

C, the preferred hybridization temperature, in prehybridization mixturecontaining 100 μg/ml denatured salmon sperm DNA and 5-20×10⁶ cpm of³²P-labeled probe. Alternatively, the hybridization step can beperformed at 65

C in the presence of SSC buffer, 1×SSC corresponding to 0.15 M NaCl and0.05 M Na citrate. Subsequently, filter washes can be done at 37

C for 1 h in a solution containing 2×SSC, 0.01% PVP, 0.01% Ficoll, and0.01% BSA, followed by a wash in 0.1×SSC at 50

C for 45 min. Alternatively, filter washes can be performed in asolution containing 2×SSC and 0.1% SDS, or 0.5×SSC and 0.1% SDS, or 0.1×SSC and 0.1% SDS at 68

C for 15 minute intervals. Following the wash steps, the hybridizedprobes are detectable by autoradiography. By way of example and notlimitation, procedures using conditions of intermediate stringency areas follows: Filters containing DNA are prehybridized, and thenhybridized at a temperature of 60

C in the presence of a 5×SSC buffer and labeled probe. Subsequently,filters washes are performed in a solution containing 2×SSC at 50

C and the hybridized probes are detectable by autoradiography. Otherconditions of high and intermediate stringency which is used are wellknown in the art and as cited in Sambrook et al. (Molecular Cloning—ALaboratory Manual, Second Edition, Cold Spring Harbor Press, N.Y., 1989)and Ausubel et al. (Current Protocols in Molecular Biology, GreenPublishing Associates and Wiley Interscience, N.Y., 1989).

[0699] Although such hybridizations can be performed in solution, it ispreferred to employ a solid-phase hybridization assay. The target DNAcomprising a biallelic marker of the present invention is amplifiedprior to the hybridization reaction. The presence of a specific allelein the sample is determined by detecting the presence or the absence ofstable hybrid duplexes formed between the probe and the target DNA. Thedetection of hybrid duplexes can be carried out by a number of methods.Various detection assay formats are well known which utilize detectablelabels bound to either the target or the probe to enable detection ofthe hybrid duplexes. Typically, hybridization duplexes are separatedfrom unhybridized nucleic acids and the labels bound to the duplexes arethen detected. Those skilled in the art will recognize that wash stepsis employed to wash away excess target DNA or probe. Standardheterogeneous assay formats are suitable for detecting the hybrids usingthe labels present on the primers and probes.

[0700] Two recently developed assays allow hybridization-based allelediscrimination with no need for separations or washes (see Landegren U.et al., Genome Research, 8:769-776,1998). The TaqMan assay takesadvantage of the 5′ nuclease activity of Taq DNA polymerase to digest aDNA probe annealed specifically to the accumulating amplificationproduct. TaqMan probes are labeled with a donor-acceptor dye pair thatinteracts via fluorescence energy transfer. Cleavage of the TaqMan probeby the advancing polymerase during amplification dissociates the donordye from the quenching acceptor dye, greatly increasing the donorfluorescence. All reagents necessary to detect two allelic variants canbe assembled at the beginning of the reaction and the results aremonitored in real time (see Livak et al., Nature Genetics, 9:341-342,1995). In an alternative homogeneous hybridization based procedure,molecular beacons are used for allele discriminations. Molecular beaconsare hairpin-shaped oligonucleotide probes that report the presence ofspecific nucleic acids in homogeneous solutions. When they bind to theirtargets they undergo a conformational reorganization that restores thefluorescence of an internally quenched fluorophore (Tyagi et al., NatureBiotechnology, 16:49-53, 1998).

[0701] The polynucleotides provided herein can be used in hybridizationassays for the detection of biallelic marker alleles in biologicalsamples. These probes are characterized in that they preferably comprisebetween 8 and 50 nucleotides, and in that they are sufficientlycomplementary to a sequence comprising a biallelic marker of the presentinvention to hybridize thereto and preferably sufficiently specific tobe able to discriminate the targeted sequence for only one nucleotidevariation. The GC content in the probes of the invention usually rangesbetween 10 and 75%, preferably between 35 and 60%, and more preferablybetween 40 and 55%. The length of these probes can range from 10, 15,20, or 30 to at least 100 nucleotides, preferably from 10 to 50, morepreferably from 18 to 35 nucleotides. A particularly preferred probe is25 nucleotides in length. Preferably the biallelic marker is within 4nucleotides of the center of the polynucleotide probe. In particularlypreferred probes the biallelic marker is at the center of saidpolynucleotide. Shorter probes may lack specificity for a target nucleicacid sequence and generally require cooler temperatures to formsufficiently stable hybrid complexes with the template. Longer probesare expensive to produce and can sometimes self-hybridize to formhairpin structures. Methods for the synthesis of oligonucleotide probeshave been described above and can be applied to the probes of thepresent invention.

[0702] Preferably the probes of the present invention are labeled orimmobilized on a solid support. Labels and solid supports are furtherdescribed in II. Detection probes are generally nucleic acid sequencesor uncharged nucleic acid analogs such as, for example peptide nucleicacids which are disclosed in International Patent Application WO92/20702, morpholino analogs which are described in U.S. Pat. Nos.5,185,444; 5,034,506 and 5,142,047. The probe may have to be rendered“non-extendable” in that additional dNTPs cannot be added to the probe.In and of themselves analogs usually are non-extendable and nucleic acidprobes can be rendered non-extendable by modifying the 3′ end of theprobe such that the hydroxyl group is no longer capable of participatingin elongation. For example, the 3′ end of the probe can befunctionalized with the capture or detection label to thereby consume orotherwise block the hydroxyl group. Alternatively, the 3′ hydroxyl groupsimply can be cleaved, replaced or modified, U.S. patent applicationSer. No. 07/049,061 filed Apr. 19, 1993 describes modifications, whichcan be used to render a probe non-extendable.

[0703] The probes of the present invention are useful for a number ofpurposes. They can be used in Southern hybridization to genomic DNA orNorthern hybridization to mRNA. The probes can also be used to detectPCR amplification products. By assaying the hybridization to an allelespecific probe, one can detect the presence or absence of a biallelicmarker allele in a given sample.

[0704] High-Throughput parallel hybridizations in array format arespecifically encompassed within “hybridization assays” and are describedbelow.

[0705] Hybridization to addressable arrays of oligonucleotides

[0706] Hybridization assays based on oligonucleotide arrays rely on thedifferences in hybridization stability of short oligonucleotides toperfectly matched and mismatched target sequence variants. Efficientaccess to polymorphism information is obtained through a basic structurecomprising high-density arrays of oligonucleotide probes attached to asolid support (the chip) at selected positions. Each DNA chip cancontain thousands to millions of individual synthetic DNA probesarranged in a grid-like pattern and miniaturized to the size of a dime.

[0707] The chip technology has already been applied with success innumerous cases. For example, the screening of mutations has beenundertaken in the BRCAI gene, in S. cerevisiae mutant strains, and inthe protease gene of HIV-1 virus (Hacia et al., Nature Genetics,14(4):441447, 1996; Shoemaker et al., Nature Genetics, 14(4):450456,1996; Kozal et al., Nature Medicine, 2:753-759, 1996). Chips of variousformats for use in detecting biallelic polymorphisms can be produced ona customized basis by Affymetrix (GeneChip™), Hyseq (HyChip andHyGnostics), and Protogene Laboratories.

[0708] In general, these methods employ arrays of oligonucleotide probesthat are complementary to target nucleic acid sequence segments from anindividual which, target sequences include a polymorphic marker.EP785280 describes a tiling strategy for the detection of singlenucleotide polymorphisms. Briefly, arrays may generally be “tiled” for alarge number of specific polymorphisms. By “tiling” is generally meantthe synthesis of a defined set of oligonucleotide probes which is madeup of a sequence complementary to the target sequence of interest, aswell as preselected variations of that sequence, e.g., substitution ofone or more given positions with one or more members of the basis set ofmonomers, i.e. nucleotides. Tiling strategies are further described inPCT application No. WO 95/11995. In a particular aspect, arrays aretiled for a number of specific, identified biallelic marker sequences.In particular the array is tiled to include a number of detectionblocks, each detection block being specific for a specific biallelicmarker or a set of biallelic markers. For example, a detection block istiled to include a number of probes, which span the sequence segmentthat includes a specific polymorphism. To ensure probes that arecomplementary to each allele, the probes are synthesized in pairsdiffering at the biallelic marker. In addition to the probes differingat the polymorphic base, monosubstituted probes are also generally tiledwithin the detection block. These monosubstituted probes have bases atand up to a certain number of bases in either direction from thepolymorphism, substituted with the remaining nucleotides (selected fromA, T, G, C and U). Typically the probes in a tiled detection block willinclude substitutions of the sequence positions up to and includingthose that are 5 bases away from the biallelic marker. Themonosubstituted probes provide internal controls for the tiled array, todistinguish actual hybridization from artefactual cross-hybridization.Upon completion of hybridization with the target sequence and washing ofthe array, the array is scanned to determine the position on the arrayto which the target sequence hybridizes. The hybridization data from thescanned array is then analyzed to identify which allele or alleles ofthe biallelic marker are present in the sample. Hybridization andscanning is carried out as described in PCT application No. WO 92/10092and WO 95/11995 and U.S. Pat. No. 5,424,186.

[0709] 5) Integrated Systems

[0710] Another technique, which is used to analyze polymorphisms,includes multicomponent integrated systems, which miniaturize andcompartmentalize processes such as PCR and capillary electrophoresisreactions in a single functional device. An example of such technique isdisclosed in U.S. Pat. No. 5,589,136, which describes the integration ofPCR amplification and capillary electrophoresis in chips.

[0711] Integrated systems can be envisaged mainly when microfluidicsystems are used. These systems comprise a pattern of microchannelsdesigned onto a glass, silicon, quartz, or plastic wafer included on amicrochip. The movements of the samples are controlled by electric,electroosmotic or hydrostatic forces applied across different areas ofthe microchip to create functional microscopic valves and pumps with nomoving parts. Varying the voltage controls the liquid flow atintersections between the micro-machined channels and changes the liquidflow rate for pumping across different sections of the microchip.

[0712] For genotyping biallelic markers, the microfluidic system mayintegrate nucleic acid amplification, microsequencing, capillaryelectrophoresis and a detection method such as laser-inducedfluorescence detection.

[0713] XI. Methods of Genetic Analysis Using the Biallelic Markers ofthe Present Invention

[0714] The methods available for the genetic analysis of complex traitsfall into different categories (see Lander and Schork, Science, 265,2037-2048, 1994). In general, the biallelic markers of the presentinvention find use in any method known in the art to demonstrate astatistically significant correlation between a genotype and aphenotype. The biallelic markers is used in linkage analysis and inallele-sharing methods. Preferably, the biallelic markers of the presentinvention are used to identify genes associated with detectable traitsusing association studies, an approach which does not require the use ofaffected families and which permits the identification of genesassociated with complex and sporadic traits.

[0715] The genetic analysis using the biallelic markers of the presentinvention is conducted on any scale. The whole set of biallelic markersof the present invention or any subset of biallelic markers of thepresent invention is used. In some embodiments, any additional set ofgenetic markers including a biallelic marker of the present invention isused. As mentioned above, it should be noted that the biallelic markersof the present invention is included in any complete or partial geneticmap of the human genome. These different uses are specificallycontemplated in the present invention and claims.

[0716] XI.A. Linkage Analysis

[0717] Until recently, the identification of genes linked withdetectable traits has mainly relied on a statistical approach calledlinkage analysis. Linkage analysis involves proposing a model to explainthe inheritance pattern of phenotypes and genotypes observed in apedigree. Linkage analysis is based upon establishing a correlationbetween the transmission of genetic markers and that of a specific traitthroughout generations within a family. In this approach, all members ofa series of affected families are genotyped with a few hundred markers,typically microsatellite markers, which are distributed at an averagedensity of one every 10 Mb. By comparing genotypes in all familymembers, one can attribute sets of alleles to parental baploid genomes(haplotyping or phase determination). The origin of recombined fragmentsis then determined in the offspring of all families. Those thatco-segregate with the trait are tracked. After pooling data from allfamilies, statistical methods are used to determine the likelihood thatthe marker and the trait are segregating independently in all families.As a result of the statistical analysis, one or several regions having ahigh probability of harboring a gene linked to the trait are selected ascandidates for further analysis. The result of linkage analysis isconsidered as significant (i.e. there is a high probability that theregion contains a gene involved in a detectable trait) when the chanceof independent segregation of the marker and the trait is lower than 1in 1000 (expressed as a LOD score>3). Generally, the length of thecandidate region identified as having a LOD score of greater than 3using linkage analysis is between 2 and 20 Mb. Once a candidate regionis identified as described above, analysis of recombinant individualsusing additional markers allows further delineation of the candidateregion. Linkage analysis studies have generally relied on the use of amaximum of 5,000 microsatellite markers, thus limiting the maximumtheoretical attainable resolution of linkage analysis to about 600 kb onaverage.

[0718] Linkage analysis has been successfully applied to map simplegenetic traits that show clear Mendelian inheritance patterns and whichhave a high penetrance (i.e., the ratio between the number of traitpositive carriers of allele a and the total number of a carriers in thepopulation). About 100 pathological trait-causing genes were discoveredusing linkage analysis over the last 10 years. In most of these cases,the majority of affected individuals had affected relatives and thedetectable trait was rare in the general population (frequencies lessthan 0.1%). In about 10 cases, such as Alzheimer's Disease, breastcancer, and Type II diabetes, the detectable trait was more common butthe allele associated with the detectable trait was rare in the affectedpopulation. Thus, the alleles associated with these traits were notresponsible for the trait in all sporadic cases.

[0719] Linkage analysis suffers from a variety of drawbacks. First,linkage analysis is limited by its reliance on the choice of a geneticmodel suitable for each studied trait. Furthermore, as alreadymentioned, the resolution attainable using linkage analysis is limited,and complementary studies are required to refine the analysis of thetypical 2 Mb to 20 Mb regions initially identified through linkageanalysis. In addition, linkage analysis approaches have proven difficultwhen applied to complex genetic traits, such as those due to thecombined action of multiple genes and/or environmental factors. In suchcases, too large an effort and cost are needed to recruit the adequatenumber of affected families required for applying linkage analysis tothese situations, as recently discussed by Risch, N. and Merikangas, K.(Science, 273:1516-1517, 1996). Finally, linkage analysis cannot beapplied to the study of traits for which no large informative familiesare available. Typically, this will be the case in any attempt toidentify trait-causing alleles involved in sporadic cases, such asalleles associated with positive or negative responses to drugtreatment.

[0720] XI.B. Allele-Sharing Methods

[0721] Whereas linkage analysis involves proposing a model to explainthe inheritance pattern of phenotypes and genotypes in a pedigree,allele-sharing methods are not based on constructing a model, but ratheron rejecting a model (see Lander and Schork, Science, 265, 2037-2048,1994). More specifically, one tries to prove that the inheritancepattern of a chromosomal region is not consistent with random Mendeliansegregation by showing that affected relatives inherit identical copiesof the region more often than expected by chance. Because allele-sharingmethods are nonparametric (that is, assume no model for the inheritanceof the trait), they tend to be more useful for the analysis of complextraits than linkage analysis. Affected relatives should show excessallele sharing even in the presence of incomplete penetrance andpolygenic inheritance. Allele-Sharing methods involve studying affectedrelatives in a pedigree to determine how often a particular copy of achromosomal region is shared identical-by-descent (IBD), that is, isinherited from a common ancestor within the pedigree. The frequency ofIBD sharing at a locus can then be compared with random expectation.Affected sib pair analysis is a well-known special case and is thesimplest form of this method.

[0722] However, as allele-sharing methods analyze affected relatives,they tend to be of limited value in the genetic analysis of drugresponses or in the analysis of side effects to treatments. This type ofanalysis is impractical in such cases due to the lack of availability offamilial cases. In fact, the likelihood of having more than oneindividual in a family being exposed to the same drug at the same timeis very low.

[0723] XI.C. Association Studies

[0724] The present invention comprises methods for identifying one orseveral genes among a set of candidate genes that are associated with adetectable trait using the biallelic markers of the present invention.In one embodiment the present invention comprises methods to detect anassociation between a biallelic marker allele or a biallelic markerhaplotype and a trait. Further, the invention comprises methods toidentify a trait causing allele in linkage disequilibrium with anybiallelic marker allele of the present invention.

[0725] As described above, alternative approaches can be employed toperform association studies: genome-wide association studies, candidateregion association studies and candidate gene association studies. In apreferred embodiment, the biallelic markers of the present invention areused to perform candidate gene association studies. The candidate geneanalysis clearly provides a short-cut approach to the identification ofgenes and gene polymorphisms related to a particular trait when someinformation concerning the biology of the trait is available. Further,the biallelic markers of the present invention is incorporated in anymap of genetic markers of the human genome in order to performgenome-wide association studies. Methods to generate a high-density mapof biallelic markers has been described in U.S. Provisional Patentapplication Ser. No. 60/082,614. The biallelic markers of the presentinvention may further be incorporated in any map of a specific candidateregion of the genome (a specific chromosome or a specific chromosomalregion for example).

[0726] As mentioned above, association studies is conducted within thegeneral population and are not limited to studies performed on relatedindividuals in affected families. Linkage disequilibrium and associationstudies are extremely valuable as they permit the analysis of sporadicor multifactor traits. Moreover, association studies represent apowerful method for fine-scale mapping enabling much finer mapping oftrait causing alleles than linkage studies. Studies based on pedigreesoften only narrow the location of the trait causing allele. Associationstudies and Linkage Disequilibrium mapping methods using the biallelicmarkers of the present invention can therefore be used to refine thelocation of a trait causing allele in a candidate region identified byLinkage Analysis or by Allele-Sharing methods. Moreover, once achromosome segment of interest has been identified, the presence of acandidate gene such as a candidate gene of the present invention, in theregion of interest can provide a shortcut to the identification of thetrait causing allele. Biallelic markers of the present invention can beused to demonstrate that a candidate gene is associated with a trait.Such uses are specifically contemplated in the present invention andclaims.

[0727] 1) Case-Control Populations (Inclusion Criteria)

[0728] Association studies do not concern familial inheritance and donot involve the analysis of large family pedigrees but compare theprevalence of a particular genetic marker, or a set of markers, incase-control populations. They are case-control studies based oncomparison of unrelated case (affected or trait positive) individualsand unrelated control (random or unaffected or trait negative)individuals. The control group is composed of individuals chosenrandomly or of unaffected (trait negative) individuals, preferably thecontrol group is composed of unaffected or trait negative individuals.Further, the control group is preferably both ethnically-and age-matchedto the case population. In the following “trait positive population”,“case population” and “affected population” are used interchangeably.

[0729] An important step in the dissection of complex traits usingassociation studies is the choice of case-control populations (seeLander and Schork, Science, 265, 2037-2048, 1994). Narrowing thedefinition of the disease and restricting the patient population toextreme phenotypes allows one to work with a trait that is more nearlyMendelian in its inheritance pattern and more likely to be homogeneous(patients suffer from the disease for the same genetic reasons).Therefore, a major step in the choice of case-control populations is theclinical definition of a given trait or phenotype. Four criteria areoften useful: clinical phenotype, age at onset, family history andseverity. Preferably, in order to perform efficient and significantassociation studies, such as those described herein, the trait understudy should preferably follow a bimodal distribution in the populationunder study, presenting two clear non-overlapping phenotypes (traitpositive and trait negative). Nevertheless, even in the absence of suchbimodal distribution (as may in fact be the case for more complexgenetic traits), any genetic trait may still be analyzed by theassociation method proposed here by carefully selecting the individualsto be included in the trait positive and trait negative phenotypicgroups. The selection procedure involves selecting individuals atopposite ends of the non-bimodal phenotype spectra of the trait understudy, so as to include in these trait positive and trait negativepopulations individuals which clearly represent extreme, preferablynon-overlapping phenotypes. This is particularly useful for continuousor quantitative traits (such as blood pressure for example). Selectionof individuals at extreme ends of the trait distribution increases theability to analyze these complex traits. The definition of the inclusioncriteria for the case-control populations is an important aspect ofassociation studies. The selection of those drastically different butrelatively uniform phenotypes enables efficient comparisons inassociation studies and the possible detection of marked differences atthe genetic level, provided that the sample sizes of the populationsunder study are significant enough.

[0730] Preferably, case-control populations to be included inassociation studies such as those proposed in the present inventionconsist of phenotypically homogeneous populations of individuals eachrepresenting 100% of the corresponding phenotype if the traitdistribution is bimodal. If the trait distribution is non-bimodal, traitpositive and trait negative populations consist of phenotypicallyuniform populations of individuals representing each between 1 and 98%,preferably between 1 and 80%, more preferably between 1 and 50%, andmore preferably between 1 and 30%, most preferably between 1 and 20% ofthe total population under study, and selected among individualsexhibiting non-overlapping phenotypes. In some embodiments, the traitpositive and trait negative groups consist of individuals exhibiting theextreme phenotypes within the studied population. The clearer thedifference between the two trait phenotypes, the greater the probabilityof detecting an association with biallelic markers.

[0731] In preferred embodiments, a first group of between 50 and 300trait positive individuals, preferably about 100 individuals, arerecruited according to their phenotypes. A similar number of traitnegative individuals are included in such studies.

[0732] In the present invention, typical examples of inclusion criteriainclude a diagnosis of cancer or prostate cancer or the evaluation ofthe response to anti-cancer or anti-prostate cancer agent or sideeffects to treatment with anti-cancer or anti-prostate cancer agents.

[0733] Suitable examples of association studies using biallelic markersincluding the biallelic markers of the present invention, are studiesinvolving the following populations:

[0734] a case population suffering from a form of cancer and a healthyunaffected control population, or

[0735] a case population suffering from a form of prostate cancer and ahealthy unaffected control population, or

[0736] a case population treated with anticancer agents suffering fromside-effects resulting from the treatment and a control populationtreated with the same agents showing no side-effects, or

[0737] a case population treated with anti-prostate cancer agentssuffering from side-effects resulting from the treatment and a controlpopulation treated with the same agents showing no side-effects, or

[0738] a case population treated with anti-cancer agents showing abeneficial response and a control population treated with same agentsshowing no beneficial response, or

[0739] a case population treated with anti-prostate cancer agentsshowing a beneficial response and a control population treated with sameagents showing no beneficial response.

[0740] 2) Determining the Frequency of an Allele in Case-ControlPopulations

[0741] Allelic frequencies of the biallelic markers in each of thepopulations can be determined using one of the methods described aboveunder the in Section X. under the heading “Methods for genotyping anindividual for biallelic markers”, or any genotyping procedure suitablefor this intended purpose. The frequency of a biallelic marker allele ina population can be determined by genotyping pooled samples orindividual samples. One way to reduce the number of genotypings requiredis to use pooled samples. A major obstacle in using pooled samples is interms of accuracy and reproducibility for determining accurate DNAconcentrations in setting up the pools. Genotyping individual samplesprovides higher sensitivity, reproducibility and accuracy and; is thepreferred method used in the present invention. Preferably, eachindividual is genotyped separately and simple gene counting is appliedto determine the frequency of an allele of a biallelic marker or of agenotype in a given population.

[0742] 3) Determining the Frequency of a Haplotype in Case-ControlPopulations

[0743] The gametic phase of haplotypes is usually unknown when diploidindividuals are heterozygous at more than one locus. Differentstrategies for inferring haplotypes is used to partially overcome thisdifficulty (see Excoffier L. and Slatkin M., Mol. Biol. Evol., 12(5):921-927, 1995). One possibility is that the multiple-site heterozygousdiploids can be eliminated from the analysis, keeping only thehomozygotes and the single-site heterozygote individuals, but thisapproach might lead to a possible bias in the sample composition and theunderestimation of low-frequency haplotypes. Another possibility is thatsingle chromosomes can be studied independently, for example, byasymmetric PCR amplification (see Newton et al., Nucleic Acids Res.,17:2503-2516, 1989; Wu et al., Proc. Natl. Acad. Sci. USA, 86:2757,1989) or by isolation of single chromosome by limit dilution followed byPCR amplification (see Ruano et al., Proc. Natl. Acad. Sci. USA,87:6296-6300, 1990). Further, multiple haplotypes can sometimes beinferred using genealogical information in families (Perlin et al., Am.J. Hum. Genet., 55:777-787, 1994). A sample is haplotyped forsufficiently close biallelic markers by double PCR amplification ofspecific alleles (Sarkar, G. and Sommer S. S., Biotechniques, 1991).These approaches are not entirely satisfying either because of theirtechnical complexity, the additional cost they entail, their lack ofgeneralization at a large scale, or the possible biases they introduce.To overcome these difficulties, an algorithm based on Hardy-Weinbergequilibrium (random mating) to infer the phase of PCR-amplified DNAgenotypes introduced by Clark A. G. (Mol. Biol. Evol., 7:111-122, 1990)is used. Briefly, the principle is to start filling a preliminary listof haplotypes present in the sample by examining unambiguousindividuals, that is, the complete homozygotes and the single-siteheterozygotes. Then other individuals in the same sample are screenedfor the possible occurrence of previously recognized haplotypes. Foreach positive identification, the complementary haplotype is added tothe list of recognized haplotypes, until the phase information for allindividuals is either resolved or identified as unresolved. This methodassigns a single haplotype to each multiheterozygous individual, whereasseveral haplotypes are possible when there are more than oneheterozygous site. Any other method known in the art to determine thefrequency of a haplotype in a population is used. Preferably, anexpectation-maximization (EM) algorithm (Dempster et al., J. R. Stat.Soc., 39B:1-38, 1977) leading to maximum-likelihood estimates ofhaplotype frequencies under the assumption of Hardy-Weinberg proportionsis used (see Excoffier L. and Slatkin M., Mol. Biol. Evol., 12(5):921-927, 1995). The EM algorithm is used to estimate haplotypefrequencies in the case when only genotype data from unrelatedindividuals are available. The EM algorithm is a generalized iterativemaximum-likelihood approach to estimation that is useful when data areambiguous and/or incomplete. The EM algorithm is used to resolveheterozygotes into haplotypes. Haplotype estimations are furtherdescribed below under the heading “Statistical methods”.

[0744] 4) Genetic Analysis based on Linkage Disequilibrium

[0745] Linkage disequilibrium is the non-random association of allelesat two or more loci and represents a powerful tool for genetic mappingof complex traits (see Jorde L. B., Am. J Hum. Genet., 56:11-14, 1995).Biallelic markers, because they are densely spaced in the human genomeand can be genotyped in large numbers, are particularly useful ingenetic analysis based on linkage disequilibrium.

[0746] When a disease mutation is first introduced into a population (bya new mutation or the immigration of a mutation carrier), it necessarilyresides on a single chromosome and thus on a single “background” or“ancestral” haplotype of linked markers. Consequently, there is completedisequilibrium between these markers and the disease mutation: one findsthe disease mutation only in the presence of a specific set of markeralleles. Through subsequent generations recombinations occur between thedisease mutation and these marker polymorphisms, and the disequilibriumgradually dissipates. The pace of this dissipation is a function of therecombination frequency, so the markers closest to the disease gene willmanifest higher levels of disequilibrium than those that are furtheraway. When not broken up by recombination, “ancestral” haplotypes andlinkage disequilibrium between marker alleles at different loci can betracked not only through pedigrees but also through populations.

[0747] The pattern or curve of disequilibrium between disease and markerloci will exhibit a single maximum that occurs at the disease locus.Consequently, the amount of linkage disequilibrium between a diseaseallele and closely linked genetic markers may yield valuable informationregarding the location of the disease gene. For fine-scale mapping of adisease locus, it is useful to have some knowledge of the patterns oflinkage disequilibrium that exist between markers in the studied region.As mentioned above the mapping resolution achieved through the analysisof linkage disequilibrium is much higher than that of linkage studies.The high density of biallelic markers combined with linkagedisequilibrium analysis provide powerful tools for fine-scale mapping.Different methods to calculate linkage disequilibrium are describedbelow under the heading “Statistical Methods”. Moreover, associationstudies as a method of mapping genetic traits rely on the phenomenon oflinkage disequilibrium.

[0748] 3) Association Studies

[0749] As mentioned above, the occurrence of pairs of specific allelesat different loci on the same chromosome is not random, and thedeviation from random is called linkage disequilibrium. If a specificallele in a given gene is directly involved in causing a particulartrait, its frequency will be statistically increased in an affected(trait positive) population when compared to the frequency in a traitnegative population or in a random control population. As a consequenceof the existence of linkage disequilibrium, the frequency of all otheralleles present in the haplotype carrying the trait-causing allele willalso be increased in trait positive individuals compared to traitnegative individuals or random controls. Therefore, association betweenthe trait and any allele (specifically a biallelic marker allele) inlinkage disequilibrium with the trait-causing allele will suffice tosuggest the presence of a trait-related gene in that particular allele'sregion. Association studies focus on population frequencies.Case-control populations can be genotyped for biallelic markers toidentify associations that narrowly locate a trait causing allele.Moreover, any marker in linkage disequilibrium with one given markerassociated with a trait will be associated with the trait. Linkagedisequilibrium allows the relative frequencies in case-controlpopulations of a limited number of genetic polymorphisms (specificallybiallelic markers) to be analyzed as an alternative to screening allpossible functional polymorphisms in order to find trait-causingalleles. Association studies compare the frequency of marker alleles inunrelated case-control populations, and represent powerful tools for thedissection of complex traits.

[0750] Association Analysis

[0751] The general strategy to perform association studies usingbiallelic markers derived from a candidate gene is to scan two groups ofindividuals (case-control populations) in order to measure andstatistically compare the allele frequencies of the biallelic markers ofthe present invention in both groups.

[0752] If a statistically significant association with a trait isidentified for at least one or more of the analyzed biallelic markers,one can assume that: either the associated allele is directlyresponsible for causing the trait (the associated allele is the traitcausing allele), or more likely the associated allele is in linkagedisequilibrium with the trait causing allele. The specificcharacteristics of the associated allele with respect to the candidategene function usually gives further insight into the relationshipbetween the associated allele and the trait (causal or in linkagedisequilibrium). If the evidence indicates that the associated allelewithin the candidate gene is most probably not the trait causing allelebut is in linkage disequilibrium with the real trait causing allele,then the trait causing allele can be found by sequencing the vicinity ofthe associated marker.

[0753] Association studies are usually run in two successive steps. In afirst phase, the frequencies of a reduced number of biallelic markersfrom one or several candidate genes are determined in the trait positiveand trait negative populations. In a second phase of the analysis, theidentity of the candidate gene and the position of the genetic lociresponsible for the given trait is further refined using a higherdensity of markers from the relevant gene. However, if the candidategene under study is relatively small in length, as it is the case formany of the candidate genes analyzed included in the present invention,a single phase is sufficient to establish significant associations.

[0754] Haplotype Analysis

[0755] As described above, when a chromosome carrying a disease allelefirst appears in a population as a result of either mutation ormigration, the mutant allele necessarily resides on a chromosome havinga unique set of linked markers: the ancestral haplotype. This haplotypecan be tracked through populations and its statistical association witha given trait can be analyzed. The statistical power of associationstudies is increased by complementing single point (allelic) associationstudies with multi-point association studies also called haplotypestudies. Thus, a haplotype association study allows one to define thefrequency and the type of the ancestral carrier haplotype. A haplotypeanalysis is important in that it increases the statistical significanceof an analysis involving individual markers. Indeed, by performing anassociation study with a set of biallelic markers, it increases thevalue of the results obtained through the study, allowing false positiveand/or negative data that may result from the single marker studies tobe eliminated.

[0756] In a first stage of a haplotype frequency analysis, the frequencyof the possible haplotypes based on various combinations of theidentified biallelic markers of the invention is determined. Thehaplotype frequency is then compared for distinct populations of traitpositive and control individuals. The number of trait positiveindividuals which should be subjected to this analysis to obtainstatistically significant results usually ranges between 30 and 300,with a preferred number of individuals ranging between 50 and 150. Thesame considerations apply to the number of random control or unaffectedindividuals used in the study. The results of this first analysisprovide haplotype frequencies in case-control populations, the relativerisk for an individual carrying a given haplotype of being affected withthe given trait under study and the estimated p value for each evaluatedhaplotype.

[0757] Interaction Analysis

[0758] The biallelic markers of the present invention may also be usedto identify patterns of biallelic markers associated with detectabletraits resulting from polygenic interactions. The analysis of geneticinteraction between alleles at unlinked loci requires individualgenotyping using the techniques described herein. The analysis ofallelic interaction among a selected set of biallelic markers withappropriate level of statistical significance can be considered as ahaplotype analysis, similar to those described in further details withinthe present invention. Preferably, genotyping typing is performed usingthe microsequencing technique.

[0759] Methods to test for association between a trait and a biallelicmarker allele or a haplotype of biallelic marker alleles are describedbelow.

[0760] XI.D. Statistical Methods

[0761] In general, any method known in the art to test whether a traitand a genotype show a statistically significant correlation is used.

[0762] Methods to estimate haplotype frequencies in a population

[0763] As described above, when genotypes are scored, it is often notpossible to distinguish heterozygotes so that haplotype frequenciescannot be easily inferred. When the gametic phase is not known,haplotype frequencies can be estimated from the multilocus genotypicdata. Any method known to person skilled in the art can be used toestimate haplotype frequencies (see Lange K., Mathematical andStatistical Methods for Genetic Analysis, Springer, N.Y., 1997; Weir, B.S., Genetic data Analysis II. Methods for Discrete population geneticData, Sinauer Assoc., Inc., Sunderland, Mass., USA, 1996) Preferably,maximum-likelihood haplotype frequencies are computed using anExpectation-Maximization (EM) algorithm (see Dempster et al., J R. Stat.Soc., 39B:1-38, 1977; Excoffier L. and Slatkin M., Mol. Biol. Evol.,12(5): 921-927, 1995). This procedure is an iterative process aiming atobtaining maximum-likelihood estimates of haplotype frequencies frommulti-locus genotype data when the gametic phase is unknown. Haplotypeestimations are usually performed by applying the EM algorithm using forexample the EM-HAPLO program (Hawley M. E. et al., Am. J Phys.Anthropol., 18:104, 1994) or the Arlequin program (Schneider et al.,Arlequin: a software for population genetics data analysis, Universityof Geneva, 1997). The EM algorithm is a generalized iterative maximumlikelihood approach to estimation and is briefly described below.

[0764] In the following part of this text, phenotypes will refer tomulti-locus genotypes with unknown phase. Genotypes will refer toknown-phase multi-locus genotypes.

[0765] Suppose a sample of N unrelated individuals typed for K markers.The data observed are the unknown-phase K-locus phenotypes that cancategorized in F different phenotypes. Suppose that we have H underlyingpossible haplotypes (in case of K biallelic markers, H=2^(K)). Forphenotype j, suppose that cj genotypes are possible. We thus have thefollowing equation $\begin{matrix}{P_{j} = {{\sum\limits_{i = 1}^{c_{j}}\quad {{pr}\left( {genotype}_{i} \right)}} = {\sum\limits_{i = 1}^{c_{j}}\quad {{pr}\left( {h_{k},h_{l}} \right)}}}} & {{Equation}\quad 1}\end{matrix}$

[0766] where Pj is the probability of the phenotype j, hk and hl are thetwo haplotypes constituent the genotype i. Under the Hardy-Weinbergequilibrium, pr(hk, hl) becomes:

pr(h _(k) ,h _(l))=pr(h _(k))² if h _(k) =h _(l) , pr(h _(k) ,h _(l))=2pr(h _(k)).pr(h _(l)) if h _(k) ≠h _(l).

[0767] Equation 2

[0768] The successive steps of the E-M algorithm can be described asfollows: Starting with initial values of the of haplotypes frequencies,noted, p₁ ⁽⁰⁾,p₂ ⁽⁰⁾, . . . p_(T) ⁽⁰⁾. these initial values serve toestimate the genotype frequencies (Expectation step) and then estimateanother set of haplotype frequencies (Maximization step): p₁ ⁽¹⁾,p₂ ⁽¹⁾,. . . p_(T) ⁽¹⁾. these two steps are iterated until change in the setsof haplotypes frequency are very small.

[0769] A stop criterion can be that the maximum difference betweenhaplotype frequencies between two iterations is less than 10⁻⁷. Thesevalues can be adjusted according to the desired precision ofestimations.

[0770] In detail, at a given iteration s, the Expectation step consistsin calculating the genotypes frequencies by the following equation:

pr(genotype_(i))^((s))=pr(phenotype_(j)).pr(genotype_(i)¦phenotype_(j))^((s))

[0771] $\begin{matrix}{= {\frac{n_{j}}{N} \cdot \frac{{{pr}\left( {h_{k},h_{l}} \right)}^{(s)}}{P_{j}^{(s)}}}} & {{Equation}\quad 3}\end{matrix}$

[0772] where genotype i occurs in phenotype j, and where hk and hiconstitute genotype i. Each probability are derived according toequations 1 and 2 above.

[0773] Then the Maximization step simply estimates another set ofhaplotype frequencies given the genotypes frequencies. This approach isalso known as gene-counting method (Smith, Ann. Hum. Genet., 21:254-276,1957). $\begin{matrix}{p_{t}^{({s + 1})} = {\frac{1}{2}{\sum\limits_{j = 1}^{F}\quad {\sum\limits_{i = 1}^{c_{j}}\quad {\delta_{it} \cdot {{pr}\left( {genotype}_{i} \right)}^{(s)}}}}}} & {{Equation}\quad 4}\end{matrix}$

[0774] where δ_(it) is an indicator variable which count the number oftime haplotype t in genotype i. It takes the values of 0, 1 or 2.

[0775] To ensure that the estimation finally obtained are themaximum-likelihood estimations several values of departures arerequired. The estimations obtained are compared and if they differ theestimations leading to the best likelihood are kept. The term “haplotypedetermination method” is used to refer to all methods for determininhaplotypes known in the art including expectation-maximizationalgorithms.

[0776] Methods to Calculate Linkage Disequilibrium Between Markers

[0777] A number of methods can be used to calculate linkagedisequilibrium between any two genetic positions, in practice, linkagedisequilibrium is measured by applying a statistical association test tohaplotype data taken from a population. Linkage disequilibrium betweenany pair of biallelic markers comprising at least one of the biallelicmarkers of the present invention (M_(i),M_(j)) can be calculated forevery allele combination (M_(i1),M_(j1);M_(i1),M_(j2);M_(i2),M_(j1) andM_(i2),M_(j2)), according to the Piazza formula:

ΔM _(ik) ,M _(jl)={square root}θ4−{square root}(θ4+θ3)(θ4+θ2), where:

[0778] θ4=−−= frequency of genotypes not having allele k at M, and nothaving allele 1 at M_(j)

[0779] θ3=−+= frequency of genotypes not having allele k at M_(i) andhaving allele 1 at M_(j)

[0780] θ2=+−= frequency of genotypes having allele k at M_(i) and nothaving allele 1 at M_(j)

[0781] Linkage disequilibrium (LD) between pairs of biallelic markers(Mi, Mj) can also be calculated for every allele combination(M_(i1),M_(j1);M_(i1),M_(j2); M_(i2), M_(j1) and M_(i2),M_(j2)),according to the maximum-likelihood estimate (MLE) for delta (thecomposite linkage disequilibrium coefficient), as described by Weir (B.S. Weir, Genetic Data Analysis, Sinauer Ass. Eds, 1996). This formulaallows linkage disequilibrium between alleles to be estimated when onlygenotype, and not haplotype, data are available. This LD composite testmakes no assumption for random mating in the sampled population, andthus seems to be more appropriate than other LD tests for genotypicdata.

[0782] Another means of calculating the linkage disequilibrium betweenmarkers is as follows. For a couple of biallelic markers, Mi (a₁/b₁) andMj (a_(j)b_(j)), fitting the Hardy-Weinberg equilibrium, one canestimate the four possible haplotype frequencies in a given populationaccording to the approach described above.

[0783] The estimation of gametic disequilibrium between ai and aj issimply:

D _(aiaj) =pr(haplotype(a _(i) , a _(j)))−pr(a _(i)).pr(a _(j)).

[0784] Where pr(ai) is the probability of allele ai and aj is theprobability of allele aj. and where pr(haplotype (ai, aj)) is estimatedas in Equation 3 above.

[0785] For a couple of biallelic marker only one measure ofdisequilibrium is necessary to describe the association between Mi andMj.

[0786] Then a normalized value of the above is calculated as follows:

D′aiaj=Daiaj/max(pr(ai).pr(aj),pr(bi).(bj)) with Daiaj<0

D′aiaj=Daiaj/max(pr(bi).pr(aj),pr(ai).(bj)) with Daiaj>0

[0787] The skilled person will readily appreciate that other LDcalculation methods can be used without undue experimentation.

[0788] Linkage disequilibrium among a set of biallelic markers having anadequate heterozygosity rate can be determined by genotyping between 50and 1000 unrelated individuals, preferably between 75 and 200, morepreferably around 100.

[0789] Testing for Association

[0790] Methods for determining the statistical significance of acorrelation between a phenotype and a genotype, in this case an alleleat a biallelic marker or a haplotype made up of such alleles, isdetermined by any statistical test known in the art and with anyaccepted threshold of statistical significance being required. Theapplication of particular methods and thresholds of significance arewell with in the skill of the ordinary practitioner of the art.

[0791] Testing for association is performed by determining the frequencyof a biallelic marker allele in case and control populations andcomparing these frequencies with a statistical test to determine iftheir is a statistically significant difference in frequency which wouldindicate a correlation between the trait and the biallelic marker alleleunder study. Similarly, a haplotype analysis is performed by estimatingthe frequencies of all possible haplotypes for a given set of biallelicmarkers in case and control populations, and comparing these frequencieswith a statistical test to determine if their is a statisticallysignificant correlation between the haplotype and the phenotype (trait)under study. Any statistical tool useful to test for a statisticallysignificant association between a genotype and a phenotype is used.Preferably the statistical test employed is a chi square test with onedegree of freedom. A P-value is calculated (the P-value is theprobability that a statistic as large or larger than the observed onewould occur by chance).

[0792] Statistical Significance

[0793] In preferred embodiments, significance for diagnosis purposes,either as a positive basis for further diagnostic tests or as apreliminary starting point for early preventive therapy, the p valuerelated to a biallelic marker association is preferably about 1×10-2 orless, more preferably about 1×10-4 or less, for a single biallelicmarker analysis and about 1×10-3 or less, still more preferably 1×10-6or less and most preferably of about 1×10-8 or less, for a haplotypeanalysis involving several markers. These values are believed to beapplicable to any association studies involving single or multiplemarker combinations.

[0794] The skilled person can use the range of values set forth above asa starting point in order to carry out association studies withbiallelic markers of the present invention. In doing so, significantassociations between the biallelic markers of the present invention andcancer and prostate cancer can be revealed and used for diagnosis anddrug screening purposes.

[0795] Using the method described above and evaluating the associationsfor single marker alleles or for haplotypes permits an estimation of therisk a corresponding carrier has to develop a given trait, andparticularly in the context of the present invention, a disease,preferably cancer, more preferably prostate cancer. Significancethresholds of relative risks are to be adapted to the reference samplepopulation used.

[0796] In this regard, among all the possible marker combinations orhaplotypes which are evaluated to determine the significance of theirassociation with a given trait, for example a form of cancer or prostatecancer, a response to treatment with anti-cancer or anti-prostate canceragents or side effects related to treatment with anti-cancer oranti-prostate cancer agents, it is believed that those displaying acoefficient of relative risk above 1, preferably about 5 or more,preferably of about 7 or more are indicative of a “significant risk” forthe individuals carrying the identified haplotype to develop the giventrait. It is difficult to evaluate accurately quantified boundaries forthe so-called “significant risk”. Indeed, and as it has beendemonstrated previously, several traits observed in a given populationare multifactorial in that they are not only the result of a singlegenetic predisposition but also of other factors such as environmentalfactors or the presence of further, apparently unrelated, haplotypeassociations. Thus, the evaluation of a significant risk must take theseparameters into consideration in order to, in a certain manner, weighthe potential importance of external parameters in the development of agiven trait. Without wishing to be bound to any invariable model ortheory based on the above statistical analyses, the inventors believethat a “significant risk” to develop a given trait is evaluateddifferently depending on the trait under consideration.

[0797] It will of course be understood by practitioners skilled in thetreatment or diagnosis of cancer and prostate cancer that the presentinvention does not intend to provide an absolute identification ofindividuals who could be at risk of developing a particular form ofcancer or who will or will not respond or exhibit side effects totreatment with anti-cancer or anti-prostate cancer agents but rather toindicate a certain degree or likelihood of developing a disease or ofobserving in a given individual a response or a side effect to treatmentwith a particular agent or set of agents.

[0798] However, this information is extremely valuable as it can, incertain circumstances, be used to initiate preventive treatments or toallow an individual carrying a significant haplotype to foresee warningsigns such as minor symptoms. In the case of cancer, the knowledge of apotential predisposition, even if this predisposition is not absolute,might contribute in a very significant manner to treatment, or allow forsuggestions in changes in diet or the reduction of risky behaviors, e.g.smoking. Similarly, a diagnosed predisposition to a potential sideeffect could immediately direct the physician toward a treatment, forwhich such side effects have not been observed during clinical trials.

[0799] Phenotypic randomization

[0800] In order to confirm the statistical significance of the firststage haplotype analysis described above, it might be suitable toperform further analyses in which genotyping data from case-controlindividuals are pooled and randomized with respect to the traitphenotype. Each individual genotyping data is randomly allocated to twogroups which contain the same number of individuals as the case-controlpopulations used to compile the data obtained in the first stage. Asecond stage haplotype analysis is preferably run on these artificialgroups, preferably for the markers included in the haplotype of thefirst stage analysis showing the highest relative risk coefficient. Thisexperiment is reiterated between 50 and 200 times, preferably between 75and 125 times. The repeated iterations allow the determination of thepercentage of obtained haplotypes with a significant p-value level belowabout 1×10-3.

EXAMPLE 24 Detailed Association Studies

[0801] The initial association studies between the 8p23 locus andprostate cancer described in Section I.D. were repeated at a higherlevel of sophistication.

[0802] Collection of DNA Samples from Affected and Non-AffectedIndividuals

[0803] Prostate cancer patients were recruited according to clinicalinclusion criteria based on pathological or radical prostatectomyrecords as described above in Section I. However, the pool ofindividuals suffering from prostate cancer described in Section I wasaugmented from the original 185 individuals to a range of between 275and 491 individuals depending on the marker tested. Similarly, thecontrol pool of non-diseased individuals described in Section I wasaugmented from the original 104 individuals to a range of between 130and 313 individuals depending on the marker tested.

[0804] Genotyping Affected and Control Individuals

[0805] As for Section I.D., allelic frequencies of the biallelic markersin each population were determined by performing microsequencingreactions on amplified fragments obtained by genomic PCR performed onthe DNA samples from each individual as described in Example 5.

[0806] Association Studies

[0807] Association results were obtained using markers spanning a 650 kbregion of the 8 p28 locus around PG1 both using single point analysisand haplotyping studies. See FIG. 16. As compared with the earlierrepresentation of the initial association results for this region shownin FIG. 2, FIG. 16 is to scale, since the entire region has now beensequenced. In addition, more markers were generated around theassociation peak in the area of PG1; each of which has been tested insingle point analysis (hence the density of data within this subregion).The haplotyping curve in FIG. 16 represents, for each marker considered,the maximum p-value for haplotypes obtained using this marker and anynumber from all markers harbored by the same BAC and being in HardyWeindeberg Disequilibrium with said marker.

[0808] The data presented in FIG. 16 shows a strong association betweenthis specific region within 8p23 locus, especially in the area that hasbeen identified as being the PG1 gene, and prostate cancer. The maximump-value in single point analysis, for the PG1 sub-region, is 3.10⁻³,while outside of the PG1 subregion, most of the p-values obtained forsingle point associations are less significant than 1.10⁻¹. The maximump-value obtained for haplotyping studies is the one obtained for amarker inside PG1's BAC, and equals 3.10⁻⁶.

[0809]FIG. 17 is a graph showing an enlarged view of the single pointassociation results within a 160 kb region comprising the PG1 gene.Markers involved in this enlargement were all located on BAC B0463F01(see FIG. 16), except marker 4-14, which lies in very close proximity,on BAC B0189E08. FIG. 17 shows all of the markers which made up themaximum haplotype shown in FIG. 16. Some of these markers were laterrevealed to lie within the promoter, exonic or intronic regions of thePG1 gene. The markers outside the gene were all informative biallelicmarkers with a least frequent allele present at a frequency of more than20%, while markers within the gene were a mix of such informativemarkers and markers whose least frequent allele's frequency is less than20%. These data confirm and narrow the previous peak of associationvalues seen in FIG. 16, to a 40 kb harboring the PG1 gene. Significantassociations are obtained for markers starting at the promoter site withmarker No. 99-1485, and ending at the 3′ UTR site with marker No. 5-66.

[0810]FIG. 18A is a graph showing an enlarged view of the single pointassociation results of 40 kb within the PG1 gene. These data confirmthat seven markers within the PG1 gene have one allele associated withprostate cancer, with p-values all similar and more significant than1.10⁻², specifically markers 99-622; 4-77 ; 4-71 ; 4-73 ; 99-598;99-576; 4-66. FIG. 18B is a table listing the location of markers withinPG1 gene, the two possible alleles at each site. For each marker, thedisease-associated allele is indicated first; its frequencies in casesand controls as well as the difference between both are shown; theodd-ratio and the p-value of each individual marker association are alsoshown.

[0811] The data in FIGS. 17, 18A, and 18B demonstrate that the markersin the PG1 gene have an association with prostate cancer that is valid,and exhibits similar significance values, regardless whether theconsidered cases are sporadic or familial cases. Therefore, some PG1alleles must be general risk factors for any type of prostate cancer,whether familial or sporadic. The fact that several p-values forassociated alleles are around 1.10⁻² suggests that all these markers arein linkage disequilibrium to one another, and can all be usedindividually to assess PG1 associated prostate cancer susceptibilityrisk. The prostate cancer associated alleles of the 7 markers discussedabove, all exhibit an odd-ratio of about 1.5, which means for each ofthem that an individual carrying such allele has 1.5 more chances to besusceptible to prostate cancer than not.

[0812] In order to confirm the significance of the association resultsfound for markers on the BAC harboring PG1, we a novel statisticalmethod was performed as described in provisional patent application Ser.No. 60/107,986, filed November 10, 1998, the specification of which isincorporated herein.

[0813] Haplotype Analysis

[0814] The results of a haplotype analysis study using 4 markers (markerNos. 4-14, 99-217, 4-66 and 99-221) ) within the 160 kb region shown inFIG. 17 are shown in FIG. 19A. These 4 markers have each been shown tobe strongly associated with prostate cancer, i.e. with p-values moresignificant than 1.10⁻³ on approximately 150 cases and 130 controls. Allhaplotypes using 2, 3, or 4 markers among the 4 above cited wereanalyzed using 491 case patients and 317 control individuals. FIG. 19Ashows the most significant haplotypes obtained, as well as theindividual odd-ratios for each. Haplotype 11 is the most significant(p-value of ca. 3.10⁻⁶), and is related to haplotype 5, shown in FIG. 4in that three of the four marker alleles (4-14 C, 99-217 T and 99-221 A)are common to both haplotypes, and both cover a similar region.Differences in p-values are explained both by the addition of markersand of more case or control individuals. Haplotype 11 has an highlyinformative odd-ratio (of above 3); it is present in 3% of the controlsand almost 10% of the cases.

[0815]FIG. 19B is a table showing the segmented haplotyping resultsaccording to the age of the subjects, and whether the prostate cancercases were sporadic or familial, using the same markers 4 markers andthe same individuals as were used to generate the results in FIG. 19A.FIG. 19B shows equivalent results for all segments of the populationanalyzed, demonstrating that the PG1 associated alleles are general riskfactors for prostate cancer, regardless of the age of onset of thedisease.

[0816] The haplotyping results and odd ratios for all of thecombinations of the 7 markers (99-622; 4-77; 4-71; 4-7; 99-598; 99-576;and 4-66) within PG1 gene that were shown in FIG. 18 to have p-valuesmore significant than 1×10⁻² were computed. A portion of these data areshown in FIG. 20. All of the 2-, 3-, 4-, 5-, 6- and 7-marker haplotypeswere tested. FIG. 20 identifies for each x-marker haplotype category,the most significant haplotype. Among all these, the most significanthaplotype is the two-marker haplotype 1, which shows a p-value ofapproximately 6.10⁻⁵, with an odd ratio of 2. The frequency of haplotype1 among the control individuals is 15%, while it is 26% among the casepatients. It is worth noting that these frequencies are very similar forall haplotypes presented on FIG. 20. It will thus be sufficient to testthis two marker haplotype for prognosis/diagnosis on risk patients, asopposed to having a more complex test of a haplotype comprising 3 ormore makers.

[0817] Finally, FIG. 21 is a graph showing the distribution ofstatistical significance, as measured by Chi-square values, for eachseries of possible x-marker haplotypes, (x=2, 3 or 4) using all of the19 markers found in PG1 gene. These data confirm that testing 2-markerhaplotypes within PG1 is sufficient because the testing 3- or 4-markerhaplotypes does not increase the statistical relevance of the analysis.

EXAMPLE 25 Attributable Risk

[0818] Attributable risk describes the proportion of individuals in apopulation exhibiting a phenotype due to exposure to a particularfactor. For further discussion of attributable risk values, see Holland,Bart K., Probability without Equations—Concepts for Clinicians; TheJohns Hopkins University Press, pp. 88-90. In the present case thephenotype examined was prostate cancer, and the exposure was either onesingle allele of an individual PG1-related marker, or a haplotypethereof in an individual's genome. The formula used for calculatingattributable risk values in the present study was the following:

AR=P _(E)(RR-1)/[P _(E)(RR-1)+1],where:

[0819] AR was the attributable risk of allele or haplotype;

[0820] P_(E) was the frequency of exposure to allele or haplotype withinthe population at large, in the present study a random male Caucasianpopulation; and

[0821] RR was the relative risk, in the present study relative risk isapproximated with the odd-ratio, because of the relatively low incidenceof prostate cancer in populations at large (values for the odd ratiosare found in FIGS. 18B and 20).

[0822] In this case, P_(E) was estimated using a dominant transmissionmodel for prostate cancer:

P _(E)=(N _(AA) +N _(AB))/N, where:

[0823] N_(AA) was the number of homozygous individuals harboring thedisease associated allele or haplotype within a given random population,and N_(AB) was the number of heterozygous individuals is said randompopulation. N_(AA) and N_(AB) were calculated using the allelefrequencies in the random population as indicated in FIGS. 18B and 20,and N was the number of individuals in total random population.

[0824] We calculated the attributable risks of disease-associatedalleles for markers within PG1 gene and presented these results in FIG.18B. In FIG. 20, the attributable risk for the two-marker haplotypespresent in the figure as shown as well. These data demonstrate thatdisease-associated alleles of PG1 are present in approximately 20% ofprostate cancer patients in the Caucasian population at large, andtherefor represent prognostic tools of significant value.

[0825] XII. Computer-Related Embodiments

[0826] As used herein the term “nucleic acid codes of the invention”encompass the nucleotide sequences comprising, consisting essentiallyof, or consisting of any one of the following: a) a contiguous span ofat least 12, 15, 18, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 150,200, 500, or 1000 nucleotides of SEQ ID No 179, wherein said contiguousspan comprises at least 1, 2, 3, 5, 10, or 25 of the followingnucleotide positions of SEQ ID No 179: 1-2324, 2852-2936, 3204-3249,3456-3572, 38994996, 5028-6086, 6310-8710, 9136-11170, 11534-12104,12733-13163, 13206-14150, 14191-14302, 14338-14359, 14788-15589,16050-16409, 16440-21718, 21959-22007, 22086-23057, 23488-23712,23832-24099, 24165-24376, 24429-24568, 24607-25096, 25127-25269,25300-27576, 27612-29217, 29415-30776, 30807-30986, 31628-32658,32699-36324, 36772-39149, 3918440269, 40580-40683, 4084441048,41271-43539, 43570-47024, 47510-48065, 48192-49692, 49723-50174,52626-53599, 54516-55209, and 55666-56146; b) a contiguous span of atleast 12, 15, 18, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 150, 200,500, or 1000 nucleotides of SEQ ID No 3 or the complements thereof,wherein said contiguous span comprises at least 1, 2, 3, 5, 10, or 25 ofthe following nucleotide positions of SEQ ID No 3: 1-280, 651-690,33154288, and 5176-5227; and c) a nucleotide sequence complementary toeither one of the preceding nucleotide sequences.

[0827] The “nucleic acid codes of the invention” further encompassnucleotide sequences homologous to: a) a contiguous span of at least 12,15, 18, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 150, 200, 500, or1000 nucleotides of SEQ ID No 179, wherein said contiguous spancomprises at least 1, 2, 3, 5, 10, or 25 of the following nucleotidepositions of SEQ ID No 179: 1-2324, 2852-2936, 3204-3249, 3456-3572,3899-4996, 5028-6086, 6310-8710, 9136-11170, 11534-12104, 12733-13163,13206-14150, 14191-14302, 14338-14359, 14788-15589, 16050-16409,16440-21718, 21959-22007, 22086-23057, 23488-23712, 23832-24099,24165-24376, 24429-24568, 24607-25096, 25127-25269, 25300-27576,27612-29217, 29415-30776, 30807-30986, 31628-32658, 32699-36324,36772-39149, 39184-40269, 40580-40683, 40844-41048, 41271-43539,43570-47024, 47510-48065, 48192-49692, 49723-50174, 52626-53599,54516-55209, and 55666-56146; b) a contiguous span of at least 12, 15,18, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 150, 200, 500, or 1000nucleotides of SEQ ID No 3 or the complements thereof, wherein saidcontiguous span comprises at least 1, 2, 3, 5, 10, or 25 of thefollowing nucleotide positions of SEQ ID No 3: 1-280, 651-690,3315-4288, and 5176-5227; and, c) sequences complementary to all of thepreceding sequences. Homologous sequences refer to a sequence having atleast 99%, 98%, 97%, 96%, 95%, 90%, 85%, 80%, or 75% homology to thesecontiguous spans. Homology may be determined using any method describedherein, including BLAST2N with the default parameters or with anymodified parameters. Homologous sequences also may include RNA sequencesin which uridines replace the thymines in the nucleic acid codes of theinvention. It will be appreciated that the nucleic acid codes of theinvention can be represented in the traditional single character format(See the inside back cover of Stryer, Lubert. Biochemistry, 3^(rd)edition. W. H Freeman & Co., New York.) or in any other format or codewhich records the identity of the nucleotides in a sequence.

[0828] As used herein the term “polypeptide codes of the invention”encompass the polypeptide sequences comprising a contiguous span of atleast 6, 8, 10, 12, 15, 20, 25, 30, 40, 50, or 100 amino acids of SEQ IDNo 4, wherein said contiguous span includes at least 1, 2, 3, or 5 ofthe amino acid positions 1-26, 295-302, and 333-353. It will beappreciated that the polypeptide codes of the invention can berepresented in the traditional single character format or three letterformat (See the inside back cover of Stryer, Lubert. Biochemistry,3^(rd) edition. W. H Freeman & Co., New York.) or in any other format orcode which records the identity of the polypeptides in a sequence.

[0829] It will be appreciated by those skilled in the art that thenucleic acid codes of the invention and polypeptide codes of theinvention can be stored, recorded, and manipulated on any medium whichcan be read and accessed by a computer. As used herein, the words“recorded” and “stored” refer to a process for storing information on acomputer medium. A skilled artisan can readily adopt any of thepresently known methods for recording information on a computer readablemedium to generate manufactures comprising one or more of the nucleicacid codes of the invention, or one or more of the polypeptide codes ofthe invention. Another aspect of the present invention is a computerreadable medium having recorded thereon at least 2, 5, 10, 15, 20, 25,30, or 50 nucleic acid codes of the invention. Another aspect of thepresent invention is a computer readable medium having recorded thereonat least 2, 5, 10, 15, 20, 25, 30, or 50 polypeptide codes of theinvention.

[0830] Computer readable media include magnetically readable media,optically readable media, electronically readable media andmagnetic/optical media. For example, the computer readable media may bea hard disk, a floppy disk, a magnetic tape, CD-ROM, Digital VersatileDisk (DVD), Random Access Memory (RAM), or Read Only Memory (ROM) aswell as other types of other media known to those skilled in the art.

[0831] Embodiments of the present invention include systems,particularly computer systems which store and manipulate the sequenceinformation described herein. One example of a computer system 100 isillustrated in block diagram form in FIG. 22. As used herein, “acomputer system” refers to the hardware components, software components,and data storage components used to analyze the nucleotide sequences ofthe nucleic acid codes of the invention or the amino acid sequences ofthe polypeptide codes of the invention. In one embodiment, the computersystem 100 is a Sun Enterprise 1000 server (Sun Microsystems, Palo Alto,Calif.). The computer system 100 preferably includes a processor forprocessing, accessing and manipulating the sequence data. The processor105 can be any well-known type of central processing unit, such as thePentium III from Intel Corporation, or similar processor from Sun,Motorola, Compaq or International Business Machines.

[0832] Preferably, the computer system 100 is a general purpose systemthat comprises the processor 105 and one or more internal data storagecomponents 110 for storing data, and one or more data retrieving devicesfor retrieving the data stored on the data storage components. A skilledartisan can readily appreciate that any one of the currently availablecomputer systems are suitable.

[0833] In one particular embodiment, the computer system 100 includes aprocessor 105 connected to a bus which is connected to a main memory 115(preferably implemented as RAM) and one or more internal data storagedevices 110, such as a hard drive and/or other computer readable mediahaving data recorded thereon. In some embodiments, the computer system100 further includes one or more data retrieving device 118 for readingthe data stored on the internal data storage devices 110.

[0834] The data retrieving device 118 may represent, for example, afloppy disk drive, a compact disk drive, a magnetic tape drive, etc. Insome embodiments, the internal data storage device 110 is a removablecomputer readable medium such as a floppy disk, a compact disk, amagnetic tape, etc. containing control logic and/or data recordedthereon. The computer system 100 may advantageously include or beprogrammed by appropriate software for reading the control logic and/orthe data from the data storage component once inserted in the dataretrieving device.

[0835] The computer system 100 includes a display 120 which is used todisplay output to a computer user. It should also be noted that thecomputer system 100 can be linked to other computer systems 125 a-c in anetwork or wide area network to provide centralized access to thecomputer system 100.

[0836] Software for accessing and processing the nucleotide sequences ofthe nucleic acid codes of the invention or the amino acid sequences ofthe polypeptide codes of the invention (such as search tools, comparetools, and modeling tools etc.) may reside in main memory 115 duringexecution.

[0837] In some embodiments, the computer system 100 may further comprisea sequence comparer for comparing the above-described nucleic acid codesof the invention or the polypeptide codes of the invention stored on acomputer readable medium to reference nucleotide or polypeptidesequences stored on a computer readable medium. A “sequence comparer”refers to one or more programs which are implemented on the computersystem 100 to compare a nucleotide or polypeptide sequence with othernucleotide or polypeptide sequences and/or compounds including but notlimited to peptides, peptidomimetics, and chemicals stored within thedata storage means. For example, the sequence comparer may compare thenucleotide sequences of nucleic acid codes of the invention or the aminoacid sequences of the polypeptide codes of the invention stored on acomputer readable medium to reference sequences stored on a computerreadable medium to identify homologies, motifs implicated in biologicalfunction, or structural motifs. The various sequence comparer programsidentified elsewhere in this patent specification are particularlycontemplated for use in this aspect of the invention.

[0838]FIG. 23 is a flow diagram illustrating one embodiment of a process200 for comparing a new nucleotide or protein sequence with a databaseof sequences in order to determine the homology levels between the newsequence and the sequences in the database. The database of sequencescan be a private database stored within the computer system 100, or apublic database such as GENBANK, PIR OR SWISSPROT that is availablethrough the Internet.

[0839] The process 200 begins at a start state 201 and then moves to astate 202 wherein the new sequence to be compared is stored to a memoryin a computer system 100. As discussed above, the memory could be anytype of memory, including RAM or an internal storage device.

[0840] The process 200 then moves to a state 204 wherein a database ofsequences is opened for analysis and comparison. The process 200 thenmoves to a state 206 wherein the first sequence stored in the databaseis read into a memory on the computer. A comparison is then performed ata state 210 to determine if the first sequence is the same as the secondsequence. It is important to note that this step is not limited toperforming an exact comparison between the new sequence and the firstsequence in the database. Well-known methods are known to those of skillin the art for comparing two nucleotide or protein sequences, even ifthey are not identical. For example, gaps can be introduced into onesequence in order to raise the homology level between the two testedsequences. The parameters that control whether gaps or other featuresare introduced into a sequence during comparison are normally entered bythe user of the computer system.

[0841] Once a comparison of the two sequences has been performed at thestate 210, a determination is made at a decision state 210 whether thetwo sequences are the same. Of course, the term “same” is not limited tosequences that are absolutely identical. Sequences that are within thehomology parameters entered by the user will be marked as “same” in theprocess 200.

[0842] If a determination is made that the two sequences are the same,the process 200 moves to a state 214 wherein the name of the sequencefrom the database is displayed to the user. This state notifies the userthat the sequence with the displayed name fulfills the homologyconstraints that were entered. Once the name of the stored sequence isdisplayed to the user, the process 200 moves to a decision state 218wherein a determination is made whether more sequences exist in thedatabase. If no more sequences exist in the database, then the process200 terminates at an end state 220. However, if more sequences do existin the database, then the process 200 moves to a state 224 wherein apointer is moved to the next sequence in the database so that it can becompared to the new sequence. In this manner, the new sequence isaligned and compared with every sequence in the database.

[0843] It should be noted that if a determination had been made at thedecision state 212 that the sequences were not homologous, then theprocess 200 would move immediately to the decision state 218 in order todetermine if any other sequences were available in the database forcomparison.

[0844] Accordingly, one aspect of the present invention is a computersystem comprising a processor, a data storage device having storedthereon a nucleic acid code of the invention or a polypeptide code ofthe invention, a data storage device having retrievably stored thereonreference nucleotide sequences or polypeptide sequences to be comparedto the nucleic acid code of the invention or polypeptide code of theinvention and a sequence comparer for conducting the comparison. Thesequence comparer may indicate a homology level between the sequencescompared or identify structural motifs in the nucleic acid code of theinvention and polypeptide codes of the invention or it may identifystructural motifs in sequences which are compared to these nucleic acidcodes and polypeptide codes. In some embodiments, the data storagedevice may have stored thereon the sequences of at least 2, 5, 10, 15,20, 25, 30, or 50 of the nucleic acid codes of the invention orpolypeptide codes of the invention.

[0845] Another aspect of the present invention is a method fordetermining the level of homology between a nucleic acid code of theinvention and a reference nucleotide sequence, comprising the steps ofreading the nucleic acid code and the reference nucleotide sequencethrough the use of a computer program which determines homology levelsand determining homology between the nucleic acid code and the referencenucleotide sequence with the computer program. The computer program maybe any of a number of computer programs for determining homology levels,including those specifically enumerated herein, including BLAST2N withthe default parameters or with any modified parameters. The method maybe implemented using the computer systems described above. The methodmay also be performed by reading 2, 5, 10, 15, 20, 25, 30, or 50 of theabove described nucleic acid codes of the invention through the use ofthe computer program and determining homology between the nucleic acidcodes and reference nucleotide sequences.

[0846]FIG. 24 is a flow diagram illustrating one embodiment of a process250 in a computer for determining whether two sequences are homologous.The process 250 begins at a start state 252 and then moves to a state254 wherein a first sequence to be compared is stored to a memory. Thesecond sequence to be compared is then stored to a memory at a state256. The process 250 then moves to a state 260 wherein the firstcharacter in the first sequence is read and then to a state 262 whereinthe first character of the second sequence is read. It should beunderstood that if the sequence is a nucleotide sequence, then thecharacter would normally be either A, T, C, G or U. If the sequence is aprotein sequence, then it should be in the single letter amino acid codeso that the first and sequence sequences can be easily compared.

[0847] A determination is then made at a decision state 264 whether thetwo characters are the same. If they are the same, then the process 250moves to a state 268 wherein the next characters in the first and secondsequences are read. A determination is then made whether the nextcharacters are the same. If they are, then the process 250 continuesthis loop until two characters are not the same. If a determination ismade that the next two characters are not the same, the process 250moves to a decision state 274 to determine whether there are any morecharacters either sequence to read.

[0848] If there aren't any more characters to read, then the process 250moves to a state 276 wherein the level of homology between the first andsecond sequences is displayed to the user. The level of homology isdetermined by calculating the proportion of characters between thesequences that were the same out of the total number of sequences in thefirst sequence. Thus, if every character in a first 100 nucleotidesequence aligned with a every character in a second sequence, thehomology level would be 100%.

[0849] Alternatively, the computer program may be a computer programwhich compares the nucleotide sequences of the nucleic acid codes of thepresent invention, to reference nucleotide sequences in order todetermine whether the nucleic acid code of the invention differs from areference nucleic acid sequence at one or more positions. Optionallysuch a program records the length and identity of inserted, deleted orsubstituted nucleotides with respect to the sequence of either thereference polynucleotide or the nucleic acid code of the invention. Inone embodiment, the computer program may be a program which determineswhether the nucleotide sequences of the nucleic acid codes of theinvention contain one or more single nucleotide polymorphisms (SNP) withrespect to a reference nucleotide sequence. These single nucleotidepolymorphisms may each comprise a single base substitution, insertion,or deletion.

[0850] Another aspect of the present invention is a method fordetermining the level of homology between a polypeptide code of theinvention and a reference polypeptide sequence, comprising the steps ofreading the polypeptide code of the invention and the referencepolypeptide sequence through use of a computer program which determineshomology levels and determining homology between the polypeptide codeand the reference polypeptide sequence using the computer program.

[0851] Accordingly, another aspect of the present invention is a methodfor determining whether a nucleic acid code of the invention differs atone or more nucleotides from a reference nucleotide sequence comprisingthe steps of reading the nucleic acid code and the reference nucleotidesequence through use of a computer program which identifies differencesbetween nucleic acid sequences and identifying differences between thenucleic acid code and the reference nucleotide sequence with thecomputer program. In some embodiments, the computer program is a programwhich identifies single nucleotide polymorphisms The method may beimplemented by the computer systems described above and the methodillustrated in FIG. 24. The method may also be performed by reading atleast 2, 5, 10, 15, 20, 25, 30, or 50 of the nucleic acid codes of theinvention and the reference nucleotide sequences through the use of thecomputer program and identifying differences between the nucleic acidcodes and the reference nucleotide sequences with the computer program.

[0852] In other embodiments the computer based system may furthercomprise an identifier for identifying features within the nucleotidesequences of the nucleic acid codes of the invention or the amino acidsequences of the polypeptide codes of the invention.

[0853] An “identifier” refers to one or more programs which identifiescertain features within the above-described nucleotide sequences of thenucleic acid codes of the invention or the amino acid sequences of thepolypeptide codes of the invention. In one embodiment, the identifiermay comprise a program which identifies an open reading frame in thecDNAs codes of the invention.

[0854]FIG. 25 is a flow diagram illustrating one embodiment of anidentifier process 300 for detecting the presence of a feature in asequence. The process 300 begins at a start state 302 and then moves toa state 304 wherein a first sequence that is to be checked for featuresis stored to a memory 115 in the computer system 100. The process 300then moves to a state 306 wherein a database of sequence features isopened. Such a database would include a list of each feature'sattributes along with the name of the feature. For example, a featurename could be “Initiation Codon” and the attribute would be “ATG”.Another example would be the feature name “TAATAA Box” and the featureattribute would be “TAATAA”. An example of such a database is producedby the University of Wisconsin Genetics Computer Group (www.gcg.com).

[0855] Once the database of features is opened at the state 306, theprocess 300 moves to a state 308 wherein the first feature is read fromthe database. A comparison of the attribute of the first feature withthe first sequence is then made at a state 310. A determination is thenmade at a decision state 316 whether the attribute of the feature wasfound in the first sequence. If the attribute was found, then theprocess 300 moves to a state 318 wherein the name of the found featureis displayed to the user.

[0856] The process 300 then moves to a decision state 320 wherein adetermination is made whether move features exist in the database. If nomore features do exist, then the process 300 terminates at an end state324. However, if more features do exist in the database, then theprocess 300 reads the next sequence feature at a state 326 and loopsback to the state 310 wherein the attribute of the next feature iscompared against the first sequence.

[0857] It should be noted, that if the feature attribute is not found inthe first sequence at the decision state 316, the process 300 movesdirectly to the decision state 320 in order to determine if any morefeatures exist in the database.

[0858] In another embodiment, the identifier may comprise a molecularmodeling program which determines the 3-dimensional structure of thepolypeptides codes of the invention. In some embodiments, the molecularmodeling program identifies target sequences that are most compatiblewith profiles representing the structural environments of the residuesin known three-dimensional protein structures. (See, e.g., Eisenberg etal., U.S. Pat. No. 5,436,850 issued Jul. 25, 1995). hi anothertechnique, the known three-dimensional structures of proteins in a givenfamily are superimposed to define the structurally conserved regions inthat family. This protein modeling technique also uses the knownthree-dimensional structure of a homologous protein to approximate thestructure of the polypeptide codes of the invention. (See e.g.,Srinivasan, et al., U.S. Pat. No. 5,557,535 issued Sep. 17, 1996).Conventional homology modeling techniques have been used routinely tobuild models of proteases and antibodies. (Sowdhamini et al., ProteinEngineering 10:207, 215 (1997)). Comparative approaches can also be usedto develop three-dimensional protein models when the protein of interesthas poor sequence identity to template proteins. In some cases, proteinsfold into similar three-dimensional structures despite having very weaksequence identities. For example, the three-dimensional structures of anumber of helical cytokines fold in similar three-dimensional topologyin spite of weak sequence homology.

[0859] The recent development of threading methods now enables theidentification of likely folding patterns in a number of situationswhere the structural relatedness between target and template(s) is notdetectable at the sequence level. Hybrid methods, in which foldrecognition is performed using Multiple Sequence Threading (MST),structural equivalencies are deduced from the threading output using adistance geometry program DRAGON to construct a low resolution model,and a full-atom representation is constructed using a molecular modelingpackage such as QUANTA.

[0860] According to this 3-step approach, candidate templates are firstidentified by using the novel fold recognition algorithm MST, which iscapable of performing simultaneous threading of multiple alignedsequences onto one or more 3-D structures. In a second step, thestructural equivalencies obtained from the MST output are converted intointerresidue distance restraints and fed into the distance geometryprogram DRAGON, together with auxiliary information obtained fromsecondary structure predictions. The program combines the restraints inan unbiased manner and rapidly generates a large number of lowresolution model confirmations. In a third step, these low resolutionmodel confirmations are converted into full-atom models and subjected toenergy minimization using the molecular modeling package QUANTA. (Seee.g., Aszódi et al., Proteins:Structure, Function, and Genetics,Supplement 1:3842 (1997)).

[0861] The results of the molecular modeling analysis may then be usedin rational drug design techniques to identify agents which modulate theactivity of the polypeptide codes of the invention.

[0862] Accordingly, another aspect of the present invention is a methodof identifying a feature within the nucleic acid codes of the inventionor the polypeptide codes of the invention comprising reading the nucleicacid code(s) or the polypeptide code(s) through the use of a computerprogram which identifies features therein and identifying featureswithin the nucleic acid code(s) or polypeptide code(s) with the computerprogram. In one embodiment, computer program comprises a computerprogram which identifies open reading frames. In a further embodiment,the computer program identifies structural motifs in a polypeptidesequence. In another embodiment, the computer program comprises amolecular modeling program. The method may be performed by reading asingle sequence or at least 2, 5, 10, 15, 20, 25, 30, or 50 of thenucleic acid codes of the invention or the polypeptide codes of theinvention through the use of the computer program and identifyingfeatures within the nucleic acid codes or polypeptide codes with thecomputer program.

[0863] The nucleic acid codes of the invention or the polypeptide codesof the invention may be stored and manipulated in a variety of dataprocessor programs in a variety of formats. For example, they may bestored as text in a word processing file, such as MicrosoftWORD orWORDPERFECT or as an ASCII file in a variety of database programsfamiliar to those of skill in the art, such as DB2, SYBASE, or ORACLE.In addition, many computer programs and databases may be used assequence comparers, identifiers, or sources of reference nucleotide orpolypeptide sequences to be compared to the nucleic acid codes of theinvention or the polypeptide codes of the invention. The following listis intended not to limit the invention but to provide guidance toprograms and databases which are useful with the nucleic acid codes ofthe invention or the polypeptide codes of the invention. The programsand databases which may be used include, but are not limited to:MacPattern (EMBL), DiscoveryBase (Molecular Applications Group),GeneMine (Molecular Applications Group), Look (Molecular ApplicationsGroup), MacLook (Molecular Applications Group), BLAST and BLAST2 (NCBI),BLASTN and BLASTX (Altschul et al., 1990, J. Mol. Biol. 215(3):403-410),FASTA (Pearson and Lipman, 1988, Proc. Natl. Acad. Sci. USA85(8):2444-2448), FASTDB (Brutlag et al. Comp. App. Biosci. 6:237-245,1990), Catalyst (Molecular Simulations Inc.), Catalyst/SHAPE (MolecularSimulations Inc.), Cerius².DBAccess (Molecular Simulations Inc.),HypoGen (Molecular Simulations Inc.), Insight II, (Molecular SimulationsInc.), Discover (Molecular Simulations Inc.), CHARMm (MolecularSimulations Inc.), Felix (Molecular Simulations Inc.), DelPhi,(Molecular Simulations Inc.), QuanteMM, (Molecular Simulations Inc.),Homology (Molecular Simulations Inc.), Modeler (Molecular SimulationsInc.), ISIS (Molecular Simulations Inc.), Quanta/Protein Design(Molecular Simulations Inc.), WebLab (Molecular Simulations Inc.),WebLab Diversity Explorer (Molecular Simulations Inc.), Gene Explorer(Molecular Simulations Inc.), SeqFold (Molecular Simulations Inc.), theEMBL/Swissprotein database, the MDL Available Chemicals Directorydatabase, the MDL Drug Data Report data base, the ComprehensiveMedicinal Chemistry database, Derwents's World Drug Index database, theBioByteMasterFile database, the Genbank database, and the Genseqndatabase. Many other programs and data bases would be apparent to one ofskill in the art given the present disclosure.

[0864] Motifs which may be detected using the above programs includesequences encoding leucine zippers, helix-turn-helix motifs,glycosylation sites, ubiquitination sites, alpha helices, and betasheets, signal sequences encoding signal peptides which direct thesecretion of the encoded proteins, sequences implicated in transcriptionregulation such as homeoboxes, acidic stretches, enzymatic active sites,substrate binding sites, and enzymatic cleavage sites.

[0865] Although this invention has been described in terms of certainpreferred embodiments, other embodiments which will be apparent to thoseof ordinary skill in the art in view of the disclosure herein are alsowithin the scope of this invention. Accordingly, the scope of theinvention is intended to be defined only by reference to the appendedclaims. All documents cited herein are incorporated herein by referencein their entirety.

What is claimed is:
 1. A recombinant, purified or isolated polynucleotide comprising a mammalian PG1 gene, cDNA, complement thereof, or fragment thereof having at least 10 nucleotides in length.
 2. The polynucleotide according to claim 1, wherein said mammalian PG1 gene or cDNA is human or mouse.
 3. The polynucleotide according to claim 2, wherein the polynucleotide is selected from SEQ ID NOs: 3, 69, 112-124, 179, and 182-184.
 4. A polynucleotide selected from SEQ ID NOs: 185-578.
 5. A purified or isolated polypeptide comprising a mammalian PG1 protein, or fragment thereof having at least 8 amino acids in length.
 6. The polypeptide according to claim 5, wherein said mammalian PG1 protein is human or mouse.
 7. The polypeptide according to claim 6, wherein said polypeptide is selected from SEQ ID NOs: 4, 5, 70, 74, and 125-136.
 8. The polypeptide according to claim 5, wherein said polypeptide consists of said mammalian PG1 protein, or fragment thereof having at least 8 amino acids in length.
 9. A polynucleotide comprising a nucleic acid sequence encoding a polypeptide according to claim
 8. 10. An antibody composition capable of selectively binding to an epitope-containing fragment of a polypeptide according to claim 8, wherein said antibody is either polyclonal or monoclonal.
 11. A vector comprising a polynucleotide according to any one of claims 1, 4, and
 9. 12. A host cell comprising a polynucleotide according to claim
 11. 13. A nonhuman host animal or mammal comprising a vector according to claim
 11. 14. A mammalian host cell comprising a PG1 gene disrupted by homologous recombination with a knock out vector.
 15. A nonhuman host mammal comprising a PG1 gene disrupted by homologous recombination with a knock out vector.
 16. A polynucleotide according to any one of claims 1, 4, and 9, further comprising a label.
 17. A polynucleotide according to any one of claims 1, 4, and 9, attached to a solid support.
 18. A random or addressable array of polynucleotides comprising at least one polynucleotide according to any one of claims 1, 4, and
 9. 19. A method of determining whether an individual is at risk of developing cancer or prostate cancer, or whether said individual suffers from cancer or prostate cancer as a result of a mutation in the PG1 gene comprising: obtaining a nucleic acid sample from said individual; and determining whether the nucleotides present at one or more PG1-related biallelic marker are indicative of a risk of developing cancer or prostate cancer or indicative of cancer or prostate cancer resulting from a mutation in the PG1 gene.
 20. A method of determining whether an individual is at risk of developing cancer or prostate cancer or whether said individual suffers from cancer or prostate cancer as a result of a mutation in the PG1 gene comprising: obtaining a nucleic acid sample from said individual; and determining whether the nucleotides present at one or more PG1-related biallelic marker are indicative of a risk of developing cancer or prostate cancer or indicative of cancer or prostate cancer resulting from a mutation in the PG1 gene.
 21. A method according to either one of claims 19 and 20, wherein said PG1-related biallelic is a PG1-related biallelic marker positioned in SEQ ID NO: 179; a PG1-related biallelic marker selected from the group consisting of 99-1485/251, 99-622/95, 99-619/141, 4-76/222, 4-77/151, 4-71/233, 4-72/127, 4-73/134, 99-610/250, 99-609/225, 4-90/283, 99-602/258, 99-600/492, 99-598/130, 99-217/277, 99-576/421, 4-61/269, 4-66/145, and 4-67/40; or a PG1-related biallelic marker selected from the group consisting of 99-622, 4-77, 4-71, 4-73, 99-598, 99-576, and 4-66.
 22. A method of obtaining an allele of the PG1 gene which is associated with a detectable phenotype comprising: obtaining a nucleic acid sample from an individual expressing said detectable phenotype; contacting said nucleic acid sample with an agent capable of specifically detecting a nucleic acid encoding the PG1 protein; and isolating said nucleic acid encoding the PG1 protein.
 23. A method of obtaining an allele of the PG1 gene which is associated with a detectable phenotype comprising: obtaining a nucleic acid sample from an individual expressing said detectable phenotype; contacting said nucleic acid sample with an agent capable of specifically detecting a sequence within the 8p23 region of the human genome; identifying a nucleic acid encoding the PG1 protein in said nucleic acid sample; and isolating said nucleic acid encoding the PG1 protein.
 24. A method of categorizing the risk of prostate cancer in an individual comprising the step of assaying a sample taken from the individual to determine whether the individual carries an allelic variant of PG1 associated with an increased risk of prostate cancer.
 25. The method of claim 24 wherein said sample is a nucleic acid sample.
 26. The method of claim 24 wherein said sample is a protein sample.
 27. The method of claim 26, further comprising determining whether the PG1 protein in said sample binds an antibody that binds specifically to a PG1 isoform associated with prostate cancer.
 28. A method of genotyping comprising determining the identity of a nucleotide at a PG1-related biallelic marker in a biological sample.
 29. A method of estimating the frequency of an allele in a population comprising determining the proportional representation of a nucleotide at a PG1-related biallelic marker in a pooled biological sample derived from said population.
 30. A method of detecting an association between a genotype and a phenotype, comprising the steps of: a) genotyping at least one PG1-related biallelic marker in a trait positive population; b) genotyping said PG1-related biallelic marker in a control population; and c) determining whether a statistically significant association exists between said genotype and said phenotype.
 31. A method of estimating the frequency of a haplotype for a set of biallelic markers in a population, comprising: a) genotyping at least one PG1-related biallelic marker; b) genotyping a second biallelic marker by determining the identity of the nucleotides at said second biallelic marker for both copies of said second biallelic marker present in the genome of each individual in said population; and c) applying an haplotype determination method to the identities of the nucleotides determined in steps a) and b) to obtain an estimate of said frequency.
 32. A method of detecting an association between a haplotype and a phenotype, comprising the steps of: a) estimating the frequency of at least one haplotype in a trait positive population according to the method of claim 31; b) estimating the frequency of said haplotype in a control population according to the method of claim 31; and c) determining whether a statistically significant association exists between said haplotype and said phenotype.
 33. A method according to claim 31, wherein said PG1-related biallelic marker and said second biallelic marker are 4-77/151 and 4-66/145,
 34. A method according to claim 32, wherein said haplotype exhibits a p-value of<1×10⁻³ in an association with a trait positive population with cancer, or prostate cancer.
 35. A method according to any one of claims 29 to 31, wherein said PG1-related biallelic is a PG1-related biallelic marker positioned in SEQ ID NO: 179; a PG1-related biallelic marker selected from the group consisting of 99-1485/251, 99-622/95, 99-619/141, 4-76/222, 4-77/151, 4-71/233, 4-72/127, 4-73/134, 99-610/250, 99-609/225, 4-90/283, 99-602/258, 99-600/492, 99-598/130, 99-217/277, 99-576/421, 4-61/269, 4-66/145, and 4-67/40 or a PG1-related biallelic marker selected from the group consisting of 99-622, 4-77, 4-71, 4-73, 99-598, 99-576, and 4-66.
 36. A method according to either one of claims 30 and 32, wherein said control population is a trait negative population or a random population.
 37. A method according to any one of claims 22, 23, 30, and 32, wherein said phenotype is a disease, cancer or prostate cancer; a response to an anti-cancer agent or an anti-prostate cancer agent; or a side effect to an anti-cancer or anti-prostate cancer agent.
 38. An isolated, purified, or recombinant polynucleotide comprising a contiguous span of at least 12 nucleotides of SEQ ID No 179 or the complements thereof, wherein said contiguous span comprises at least 1 of the following nucleotide positions of SEQ ID No 179: 1-2324, 2852-2936, 3204-3249, 3456-3572, 3899-4996, 5028-6086, 6310-8710, 9136-11170, 11534-12104, 12733-13163, 13206-14150, 14191-14302, 14338-14359, 14788-15589, 16050-16409, 16440-21718, 21959-22007, 22086-23057, 23488-23712, 23832-24099, 24165-24376, 24429-24568, 24607-25096, 25127-25269, 25300-27576, 27612-29217, 29415-30776, 30807-30986, 31628-32658, 32699-36324, 36772-39149, 39184-40269, 40580-40683, 40844-41048, 41271-43539, 43570-47024, 47510-48065, 48192-49692, 49723-50174, 52626-53599, 54516-55209, and 55666-56146.
 39. An isolated, purified, or recombinant polynucleotide comprising a contiguous span of at least 12 nucleotides of SEQ ID No 3 or the complements thereof, wherein said contiguous span comprises at least 1 of the following nucleotide positions of SEQ ID No 3: 1-280, 651-690, 3315-4288, and 5176-5227.
 40. An isolated, purified, or recombinant polynucleotide which encodes a polypeptide comprising a contiguous span of at least 8 amino acids of SEQ ID No 4, wherein said contiguous span includes at least 1 of the amino acid positions 1-26, 295-302, and 333-353
 41. An isolated, purified, or recombinant polypeptide comprising a contiguous span of at least 8 amino acids of SEQ ID No 4, wherein said contiguous span includes at least 1 of the amino acid positions 1-26, 295-302, and 333-353
 42. An isolated or purified antibody composition are capable of selectively binding to an epitope-containing fragment of a polypeptide according to claim 55, wherein said epitope comprises at least 1 of the amino acid positions 1-26, 295-302, and 333-353
 43. A computer readable medium having stored thereon a sequence selected from the group consisting of a nucleic acid code comprising one of the following: a) a contiguous span of at least 12 nucleotides of SEQ ID No 179, wherein said contiguous span comprises at least 1 of the following nucleotide positions of SEQ ID No 179: 1-2324, 2852-2936, 3204-3249, 3456-3572, 3899-4996, 5028-6086, 6310-8710, 9136-11170, 11534-12104, 12733-13163, 13206-14150, 14191-14302, 14338-14359, 14788-15589, 16050-16409, 16440-21718, 21959-22007, 22086-23057, 23488-23712, 23832-24099, 24165-24376, 24429-24568, 24607-25096, 25127-25269, 25300-27576, 27612-29217, 29415-30776, 30807-30986, 31628-32658, 32699-36324, 36772-39149, 39184-40269, 40580-40683, 40844-41048, 41271-43539, 43570-47024, 47510-48065, 48192-49692, 49723-50174, 52626-53599, 54516-55209, and 55666-56146; b) a contiguous span of at least 12 nucleotides of SEQ ID No 3 or the complements thereof, wherein said contiguous span comprises at least 1 of the following nucleotide positions of SEQ ID No 3: 1-280, 651-690, 3315-4288, and 5176-5227; and c) a nucleotide sequence complementary to either one of the preceding nucleotide sequences.
 44. A computer readable medium having stored thereon a sequence consisting of a polypeptide code comprising a contiguous span of at least 8 amino acids of SEQ ID No 4, wherein said contiguous span includes at least 1 of the amino acid positions 1-26, 295-302, and 333-353.
 45. A computer system comprising a processor and a data storage device wherein said data storage device a computer readable medium according to with claim 43 or
 44. 46. A computer system according to claim 45, further comprising a sequence comparer and a data storage device having reference sequences stored thereon.
 47. A computer system of claim 46 wherein said sequence comparer comprises a computer program which indicates polymorphisms.
 48. A computer system of claim 45 further comprising an identifier which identifies features in said sequence.
 49. A method for comparing a first sequence to a reference sequence, comprising the steps of: reading said first sequence and said reference sequence through use of a computer program which compares sequences; and determining differences between said first sequence and said reference sequence with said computer program, wherein said first sequence is selected from the group consisting of a nucleic acid code comprising one of the following: a) a contiguous span of at least 12 nucleotides of SEQ ID No 179, wherein said contiguous span comprises at least 1 of the following nucleotide positions of SEQ ID No 179: 1-2324, 2852-2936, 3204-3249, 3456-3572, 3899-4996, 5028-6086, 6310-8710, 9136-11170, 11534-12104, 12733-13163, 13206-14150, 14191-14302, 14338-14359, 14788-15589, 16050-16409, 16440-21718, 21959-22007, 22086-23057, 23488-23712, 23832-24099, 24165-24376, 24429-24568, 24607-25096, 25127-25269, 25300-27576, 27612-29217, 29415-30776, 30807-30986, 31628-32658, 32699-36324, 36772-39149, 39184-40269, 40580-40683, 40844-41048, 41271-43539, 43570-47024, 47510-48065, 48192-49692, 49723-50174, 52626-53599, 54516-55209, and 55666-56146; b) a contiguous span of at least 12 nucleotides of SEQ ID No 3 or the complements thereof, wherein said contiguous span comprises at least 1 of the following nucleotide positions of SEQ ID No 3: 1-280, 651-690, 3315-4288, and 5176-5227; c) a nucleotide sequence complementary to either one of the preceding nucleotide sequences; and d) a polypeptide code comprising a contiguous span of at least 8 amino acids of SEQ ID No 4, wherein said contiguous span includes at least 1 of the amino acid positions 1-26, 295-302, and 333-353. 